Multi-region & DR
An entire region can go down (power, fiber, natural disaster). One region = one outage from total failure.
Open the interactive version → diagrams, practice & moreThe problem
An entire region can go down (power, fiber, natural disaster). One region = one outage from total failure.
The idea
Run in multiple geographic regions with a disaster-recovery plan.
How it works
Run in multiple regions and geo-route users to the nearest healthy one. The hard part is data: sync cross-region replication is consistent but pays a speed-of-light tax on every write (~tens of ms per round trip); async is fast but a region loss drops the unreplicated tail. RPO (data you can lose) and RTO (time to recover) are the targets that pick the topology — active-passive (standby region, minutes RTO), active-active (both serve, instant failover, needs conflict handling), or geo-partitioned (each region owns its users' data).
The tradeoff
Cross-region strong consistency is expensive enough that most go active-passive or accept eventual consistency between regions; only clock-coordinated stores (Spanner) keep strong consistency globally, paying in commit latency. Data-residency laws can force geo-partitioning regardless. And DR you don't test doesn't work — failover paths rot silently, so game-day drills are mandatory.
In the wild
Spanner spans regions with strong consistency using atomic clocks; most others go eventual.
Interview deep dive
Flow
- Geo-route users to the nearest healthy region.
- Set RPO/RTO targets; pick sync vs async replication to match.
- Choose active-passive, active-active, or geo-partitioned.
- Drill failover regularly — an untested DR plan is fiction.
Watch for
- Sync cross-region writes pay a speed-of-light latency tax.
- Async replication loses the unreplicated tail on region loss.
- Untested failover paths rot — run game days.
Interviewer trap
Lead with RPO/RTO numbers; they justify active-passive vs active-active.