In a dynamic distributed system, nodes can join or leave the cluster at any time (e.g., auto-scaling, hardware failures). For the system to function, all nodes need to know who is currently in the group and their status (Alive or Dead).
A component that maintains a list of currently active nodes and notifies the rest of the system when the list changes.
In a distributed system, a node "failing" is often indistinguishable from a "slow network" or "long GC pause". We use Heartbeating to detect failures.
- Each node sends a
HEARTBEATmessage periodically. - If a node fails to send a heartbeat within a certain timeout, other nodes suspect it's dead.
Setting a timeout too short leads to false positives (detecting healthy nodes as dead). Setting it too long slows down failure recovery.
For a large cluster (thousands of nodes), a central membership service becomes a bottleneck. Gossip Protocols provide a decentralized way to spread information.
- The Algorithm: Every few seconds, a node picks a random set of other nodes and sends them its knowledge of the cluster state.
- The Outcome: Information spreads exponentially, like a virus or a rumor.
Gossip: Decentralized Information Spread
- Scalable: No single point of failure.
- Robust: Extremely resilient to packet loss and node failures.
- Low Load: Each node only communicates with a few others at a time.
A piece of information spreads to nodes in rounds of gossip. In a cluster of 1,000 nodes, a rumor reaches everyone in about 10 rounds.
In our final chapter, we will wrap up by looking at how asynchronous communication through message streams can decouple complex systems.