A distributed system is a collection of independent computers that appears to its users as a single coherent system. The nodes in a distributed system communicate and coordinate their actions by passing messages.
The goal is to make a collection of independent computers appear to its users as a single coherent system.
A system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another in order to achieve a common goal.
Key Characteristics
- Concurrency of components: Multiple machines execute simultaneously.
- Lack of a global clock: It's difficult to establish a true ordering of events.
- Independent failure of components: One machine can fail while others continue running.
We build distributed systems to scale beyond the limits of a single machine, to provide high availability and fault tolerance, and to reduce latency by placing data closer to users geographically.
When engineers first start building distributed systems, they often make assumptions that are true for a single machine but false for a network of machines. These are known as the 8 Fallacies of Distributed Computing.
| Assumption | Reality | Impact |
|---|---|---|
| The network is reliable | Packets drop, connections break | Need retries and timeouts |
| Latency is zero | Speed of light limits communication | Need batching and caching |
| Bandwidth is infinite | Network pipes get congested | Need data compression |
| The network is secure | Data can be intercepted | Need encryption (TLS/SSL) |
| Topology doesn't change | Nodes join and leave constantly | Need dynamic routing/discovery |
| There is one administrator | Different teams manage different parts | Need strict API contracts |
| Transport cost is zero | Moving data costs money | Need data locality optimization |
| The network is homogeneous | Different hardware and OSs | Need standardized protocols |
Because the network is unreliable, any RPC (Remote Procedure Call) can fail. You must design your system expecting that any external call might time out or return an error.
In the next section, we will explore the profound implications of the lack of a global clock.