System Design Library

Globally-Consistent DB (Spanner)

A horizontally-scalable SQL database with external (strong) consistency across regions.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • SQL + transactions
  • Global distribution
  • External consistency
  • Horizontal scale

Non-functional

  • Strong consistency at scale
  • Survives region loss

Scale

Global

The approach

Data sharded into splits, each a Paxos group across regions; cross-split transactions use 2PC over Paxos groups; TrueTime (GPS+atomic clocks with bounded uncertainty) lets it assign globally-ordered commit timestamps and wait out the uncertainty.

Key components

Splits = Paxos groups · 2PC coordinator (also Paxos) · TrueTime

Numbers that matter

Senior deep-dive

TrueTime is the architectural breakthrough — by bounding clock uncertainty (±7ms typical) and waiting it out before committing, Spanner assigns globally monotonic timestamps without a central coordinator, enabling external consistency (a causally stronger guarantee than serializability) at planetary scale.

Every cross-shard write goes through 2PC over Paxos — Spanner is not magic, it is careful layering of well-understood primitives (Paxos replication + 2PC coordination) with TrueTime as the glue.

Read-only transactions are free: they use a snapshot timestamp and read from the nearest replica with no locks and no 2PC, making them the performance escape hatch for reporting and analytics.

TrueTime: uncertainty as a first-class primitive

Every Spanner datacenter has GPS receivers and atomic clocks feeding a TrueTime daemon; the daemon exposes `TT.now()` returning an interval `[earliest, latest]` bounding true wall-clock time. Before committing, a Spanner leader waits until `TT.now().earliest > s` where `s` is the commit timestamp — this commit wait ensures any future transaction reads a timestamp strictly greater than `s`, giving external consistency. The uncertainty ±ε means commit wait adds at most `2ε` (~14ms) to write latency.

Paxos per split: replication before coordination

Each split (a contiguous key range, like a tablet) is its own Paxos group: a set of replicas (typically 5 across 3 regions) that agree on a log of mutations. The split leader handles reads and writes; leader leases (10s) eliminate redundant Paxos rounds for reads within the lease. Split migrations (load balancing) require moving Paxos leadership — the process involves leasing the old leader's followers to the new leader, which is why Spanner split moves are measured in seconds, not milliseconds.

Cross-shard transactions: 2PC over Paxos

A transaction touching multiple splits uses 2PC with one split designated as the transaction coordinator. The prepare phase replicates the prepare record via each participant's Paxos group; the commit phase replicates the commit record via the coordinator's Paxos group. This means a cross-shard write requires at least 3 Paxos rounds (prepare across all participants + coordinator commit) — each round is a cross-replica RTT, making cross-region transactions naturally slow.

Read-only transactions: snapshot reads at any timestamp

Snapshot reads bypass all locks and 2PC: Spanner picks a safe read timestamp (the current TrueTime minus epsilon) and routes the read to any replica with data at that timestamp. Replicas track their safe time (latest applied log entry timestamp); a read can proceed from a replica if its safe time exceeds the requested timestamp. This makes global read-only queries (analytics, reporting) lock-free and linearly scalable across replicas — the performance escape hatch most Spanner users underuse.

Schema and hotspot design

Spanner's interleaved tables physically co-locate child rows with their parent row in the same split — `INTERLEAVE IN PARENT User ON DELETE CASCADE` ensures a user and all their orders live in the same Paxos group, turning what would be cross-shard joins into local reads. Sequential primary keys (auto-increment, timestamp prefixes) create hotspot splits where all writes land on one split leader; the fix is hash-prefix or UUID keys, or using `SEQUENCE` with a large step to distribute monotonic IDs across the key space.

What breaks at scale

Commit wait latency is irreducible: if your workload needs sub-10ms global write commits, Spanner is architecturally the wrong choice — the TrueTime uncertainty wait is physics, not an implementation bug. Split hotspots under bulk loads (batch inserts with sequential keys) overload one split leader and are throttled by Spanner's admission control, causing dramatic latency spikes. Long-running transactions hold read locks on splits, blocking schema changes — Spanner schema changes are online but serialized, so a schema change on a hot table can queue behind long transactions for minutes.

In production

Google Spanner (released as Cloud Spanner in 2017) is the only production system achieving external consistency at global scale — CockroachDB and YugabyteDB are inspired by it but use hybrid logical clocks (HLC) instead of atomic clocks, trading a small consistency window for simpler hardware. Vitess (YouTube's MySQL sharding layer) solves horizontal scale without global consistency — the tradeoff is cross-shard transactions require application-level sagas. The real challenge with Spanner is hotspot prevention: lexicographically-ordered primary keys cause all new inserts to hit the same split; UUID-based or hash-prefixed keys distribute writes across splits.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →