Academy · Reliability & Observability

Multi-region & DR

An entire region can go down (power, fiber, natural disaster). One region = one outage from total failure.

Open the interactive version → diagrams, practice & more

The problem

An entire region can go down (power, fiber, natural disaster). One region = one outage from total failure.

The idea

Run in multiple geographic regions with a disaster-recovery plan.

How it works

Run in multiple regions and geo-route users to the nearest healthy one. The hard part is data: sync cross-region replication is consistent but pays a speed-of-light tax on every write (~tens of ms per round trip); async is fast but a region loss drops the unreplicated tail. RPO (data you can lose) and RTO (time to recover) are the targets that pick the topology — active-passive (standby region, minutes RTO), active-active (both serve, instant failover, needs conflict handling), or geo-partitioned (each region owns its users' data).

The tradeoff

Cross-region strong consistency is expensive enough that most go active-passive or accept eventual consistency between regions; only clock-coordinated stores (Spanner) keep strong consistency globally, paying in commit latency. Data-residency laws can force geo-partitioning regardless. And DR you don't test doesn't work — failover paths rot silently, so game-day drills are mandatory.

In the wild

Spanner spans regions with strong consistency using atomic clocks; most others go eventual.

Interview deep dive

Flow

  1. Geo-route users to the nearest healthy region.
  2. Set RPO/RTO targets; pick sync vs async replication to match.
  3. Choose active-passive, active-active, or geo-partitioned.
  4. Drill failover regularly — an untested DR plan is fiction.

Watch for

Interviewer trap

Lead with RPO/RTO numbers; they justify active-passive vs active-active.

Related Academy

Part of Academy on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →