Rebalancing & hot shards
Data grows unevenly; one shard becomes a hotspot while others idle.
Open the interactive version → diagrams, practice & moreThe problem
Data grows unevenly; one shard becomes a hotspot while others idle.
The idea
Detect skew and move data to even it out — ideally without downtime.
How it works
Skew is inevitable — data and traffic grow unevenly, so one shard turns hot while others idle. The robust fix is a fixed large partition count (say 1000 partitions over 10 nodes); rebalancing just reassigns whole partitions to nodes, never re-hashing keys. Online resharding for splits runs in stages: dual-write to old and new, backfill historical data in the background, verify parity, then cut reads over and decommission. A single hot key needs its own fix — split it or replicate it, since no partition scheme helps one key.
The tradeoff
Rebalancing is IO-heavy and risky — moving live data competes with serving traffic, so you throttle it and accept temporary imbalance over a migration storm. Fixed-partition schemes make moves cheap but cap maximum nodes at the partition count. Auto-rebalancing is convenient but can cascade (a move adds load, triggering more moves), so many systems gate it behind human approval.
In the wild
Vitess (YouTube/Slack) automates online resharding of MySQL.
Interview deep dive
Flow
- Pre-split into many fixed partitions across fewer nodes.
- Detect skew via per-shard load/size metrics.
- Reassign whole partitions, or split via dual-write → backfill → cutover.
- Handle a single hot key separately: split or replicate it.
Watch for
- Re-hashing all keys on every change is the anti-pattern (move partitions).
- A hot single key defeats any partitioning — split/replicate it.
- Auto-rebalance can storm: a move adds load that triggers more moves.
Interviewer trap
Describe online resharding as dual-write → backfill → verify → cutover, not a flip.