Globally-Consistent DB (Spanner)
A horizontally-scalable SQL database with external (strong) consistency across regions.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- SQL + transactions
- Global distribution
- External consistency
- Horizontal scale
Non-functional
- Strong consistency at scale
- Survives region loss
Scale
Global
The approach
Data sharded into splits, each a Paxos group across regions; cross-split transactions use 2PC over Paxos groups; TrueTime (GPS+atomic clocks with bounded uncertainty) lets it assign globally-ordered commit timestamps and wait out the uncertainty.
Key components
Splits = Paxos groups · 2PC coordinator (also Paxos) · TrueTime
Numbers that matter
- Spanner's TrueTime uncertainty is typically ±1–7ms; commits must wait out this uncertainty window, adding up to 14ms of latency that is fundamental and cannot be optimized away.
- Cross-region write latency in Spanner is dominated by speed-of-light RTT: US East ↔ US West is ~70ms, US ↔ Europe ~140ms — the Paxos round-trip, not TrueTime, is usually the bottleneck.
- Spanner shards data into splits (like tablets) that auto-split and move; a single Spanner instance can handle millions of QPS across thousands of splits with automatic load balancing.
- Google F1 (Ads database) migrated to Spanner and serves hundreds of millions of queries per second globally; 99th-percentile commit latency is ~150ms for cross-region transactions.
Senior deep-dive
TrueTime is the architectural breakthrough — by bounding clock uncertainty (±7ms typical) and waiting it out before committing, Spanner assigns globally monotonic timestamps without a central coordinator, enabling external consistency (a causally stronger guarantee than serializability) at planetary scale.
Every cross-shard write goes through 2PC over Paxos — Spanner is not magic, it is careful layering of well-understood primitives (Paxos replication + 2PC coordination) with TrueTime as the glue.
Read-only transactions are free: they use a snapshot timestamp and read from the nearest replica with no locks and no 2PC, making them the performance escape hatch for reporting and analytics.
TrueTime: uncertainty as a first-class primitive
Every Spanner datacenter has GPS receivers and atomic clocks feeding a TrueTime daemon; the daemon exposes `TT.now()` returning an interval `[earliest, latest]` bounding true wall-clock time. Before committing, a Spanner leader waits until `TT.now().earliest > s` where `s` is the commit timestamp — this commit wait ensures any future transaction reads a timestamp strictly greater than `s`, giving external consistency. The uncertainty ±ε means commit wait adds at most `2ε` (~14ms) to write latency.
Paxos per split: replication before coordination
Each split (a contiguous key range, like a tablet) is its own Paxos group: a set of replicas (typically 5 across 3 regions) that agree on a log of mutations. The split leader handles reads and writes; leader leases (10s) eliminate redundant Paxos rounds for reads within the lease. Split migrations (load balancing) require moving Paxos leadership — the process involves leasing the old leader's followers to the new leader, which is why Spanner split moves are measured in seconds, not milliseconds.
Cross-shard transactions: 2PC over Paxos
A transaction touching multiple splits uses 2PC with one split designated as the transaction coordinator. The prepare phase replicates the prepare record via each participant's Paxos group; the commit phase replicates the commit record via the coordinator's Paxos group. This means a cross-shard write requires at least 3 Paxos rounds (prepare across all participants + coordinator commit) — each round is a cross-replica RTT, making cross-region transactions naturally slow.
Read-only transactions: snapshot reads at any timestamp
Snapshot reads bypass all locks and 2PC: Spanner picks a safe read timestamp (the current TrueTime minus epsilon) and routes the read to any replica with data at that timestamp. Replicas track their safe time (latest applied log entry timestamp); a read can proceed from a replica if its safe time exceeds the requested timestamp. This makes global read-only queries (analytics, reporting) lock-free and linearly scalable across replicas — the performance escape hatch most Spanner users underuse.
Schema and hotspot design
Spanner's interleaved tables physically co-locate child rows with their parent row in the same split — `INTERLEAVE IN PARENT User ON DELETE CASCADE` ensures a user and all their orders live in the same Paxos group, turning what would be cross-shard joins into local reads. Sequential primary keys (auto-increment, timestamp prefixes) create hotspot splits where all writes land on one split leader; the fix is hash-prefix or UUID keys, or using `SEQUENCE` with a large step to distribute monotonic IDs across the key space.
What breaks at scale
Commit wait latency is irreducible: if your workload needs sub-10ms global write commits, Spanner is architecturally the wrong choice — the TrueTime uncertainty wait is physics, not an implementation bug. Split hotspots under bulk loads (batch inserts with sequential keys) overload one split leader and are throttled by Spanner's admission control, causing dramatic latency spikes. Long-running transactions hold read locks on splits, blocking schema changes — Spanner schema changes are online but serialized, so a schema change on a hot table can queue behind long transactions for minutes.
In production
Google Spanner (released as Cloud Spanner in 2017) is the only production system achieving external consistency at global scale — CockroachDB and YugabyteDB are inspired by it but use hybrid logical clocks (HLC) instead of atomic clocks, trading a small consistency window for simpler hardware. Vitess (YouTube's MySQL sharding layer) solves horizontal scale without global consistency — the tradeoff is cross-shard transactions require application-level sagas. The real challenge with Spanner is hotspot prevention: lexicographically-ordered primary keys cause all new inserts to hit the same split; UUID-based or hash-prefixed keys distribute writes across splits.
Common mistakes
- Assuming you can get global strong consistency for free
- Ignoring commit-wait latency
- Treating clocks as exact (they have uncertainty)