Partitioning & Sharding

Why Partition?

For very large datasets or very high throughput, replication is not enough. We need to break the data into partitions (also called shards). Each piece of data belongs to exactly one partition.

The main goal of partitioning is scalability. Different partitions can be stored on different nodes, allowing a large dataset to be spread across many disks and query load to be spread across many processors.

Partitioned Database Architecture

Router/LB

Partition 1 (A-M)

Partition 2 (N-Z)

Partitioning Strategies

How do we decide which record goes to which partition?

1. Key Range Partitioning

Assign a continuous range of keys (from some minimum to some maximum) to each partition.

Pro: Efficient range queries (e.g., "get all users with ID between 100 and 200").
Con: Certain access patterns can lead to hotspots. If keys are timestamps, all writes for today go to the same partition.

2. Hash Partitioning

Use a hash function on the key to determine the partition.

Pro: Distributes data uniformly, avoiding hotspots.
Con: Range queries are inefficient because adjacent keys are scattered across partitions.

DDefinition 6.1

Skew and Hotspots

A partition with disproportionately high load is called skewed. A node that receives significantly more requests than others is a hotspot.

Consistent Hashing

In a dynamic system where nodes are frequently added or removed, simple modulo hashing ( $hash(key) \mod N$ ) is disastrous because changing $N$ causes almost all keys to be remapped.

Consistent Hashing minimizes this by mapping both keys and nodes onto a circular "ring".

Consistent Hashing RingVisualizing nodes and keys on a logical ring to minimize remapping during scaling.

✦Intuition

The Benefit

When a node is added to a consistent hashing ring, only the keys that fall into the new node's range need to be moved. On average, only $1/N$ of the keys are remapped.

Strategy	Range Queries	Load Balancing	Scaling Ease
Key Range	Excellent	Fair (Skew risk)	Difficult
Hash-based	Poor	Excellent	Difficult (if naive)
Consistent Hash	Poor	Excellent	Excellent

In the next chapter, we will discuss how to maintain atomicity when a single operation spans multiple partitions.

← Previous

Data Replication

Course Progression

6 of 10

Distributed Transactions