For very large datasets or very high throughput, replication is not enough. We need to break the data into partitions (also called shards). Each piece of data belongs to exactly one partition.
The main goal of partitioning is scalability. Different partitions can be stored on different nodes, allowing a large dataset to be spread across many disks and query load to be spread across many processors.
Partitioned Database Architecture
How do we decide which record goes to which partition?
1. Key Range Partitioning
Assign a continuous range of keys (from some minimum to some maximum) to each partition.
- Pro: Efficient range queries (e.g., "get all users with ID between 100 and 200").
- Con: Certain access patterns can lead to hotspots. If keys are timestamps, all writes for today go to the same partition.
2. Hash Partitioning
Use a hash function on the key to determine the partition.
- Pro: Distributes data uniformly, avoiding hotspots.
- Con: Range queries are inefficient because adjacent keys are scattered across partitions.
A partition with disproportionately high load is called skewed. A node that receives significantly more requests than others is a hotspot.
In a dynamic system where nodes are frequently added or removed, simple modulo hashing () is disastrous because changing causes almost all keys to be remapped.
Consistent Hashing minimizes this by mapping both keys and nodes onto a circular "ring".
When a node is added to a consistent hashing ring, only the keys that fall into the new node's range need to be moved. On average, only of the keys are remapped.
| Strategy | Range Queries | Load Balancing | Scaling Ease |
|---|---|---|---|
| Key Range | Excellent | Fair (Skew risk) | Difficult |
| Hash-based | Poor | Excellent | Difficult (if naive) |
| Consistent Hash | Poor | Excellent | Excellent |
In the next chapter, we will discuss how to maintain atomicity when a single operation spans multiple partitions.