Academy · Partitioning & Sharding

Rebalancing & hot shards

Data grows unevenly; one shard becomes a hotspot while others idle.

Open the interactive version → diagrams, practice & more

The problem

Data grows unevenly; one shard becomes a hotspot while others idle.

The idea

Detect skew and move data to even it out — ideally without downtime.

How it works

Skew is inevitable — data and traffic grow unevenly, so one shard turns hot while others idle. The robust fix is a fixed large partition count (say 1000 partitions over 10 nodes); rebalancing just reassigns whole partitions to nodes, never re-hashing keys. Online resharding for splits runs in stages: dual-write to old and new, backfill historical data in the background, verify parity, then cut reads over and decommission. A single hot key needs its own fix — split it or replicate it, since no partition scheme helps one key.

The tradeoff

Rebalancing is IO-heavy and risky — moving live data competes with serving traffic, so you throttle it and accept temporary imbalance over a migration storm. Fixed-partition schemes make moves cheap but cap maximum nodes at the partition count. Auto-rebalancing is convenient but can cascade (a move adds load, triggering more moves), so many systems gate it behind human approval.

In the wild

Vitess (YouTube/Slack) automates online resharding of MySQL.

Interview deep dive

Flow

  1. Pre-split into many fixed partitions across fewer nodes.
  2. Detect skew via per-shard load/size metrics.
  3. Reassign whole partitions, or split via dual-write → backfill → cutover.
  4. Handle a single hot key separately: split or replicate it.

Watch for

Interviewer trap

Describe online resharding as dual-write → backfill → verify → cutover, not a flip.

Related Academy

Part of Academy on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →