Academy · Reliability & Observability

Redundancy & failover

Anything can fail: a disk, a server, a rack, a whole region. A single copy of anything is a time bomb.

Open the interactive version → diagrams, practice & more

The problem

Anything can fail: a disk, a server, a rack, a whole region. A single copy of anything is a time bomb.

The idea

Eliminate single points of failure with redundant components and automatic failover.

How it works

Eliminate single points of failure with N+1 instances across failure domains (zones, racks, power) and automatic failover: health-check, then promote a standby when the primary fails. Availability is series/parallel math — components in series multiply failure probabilities, parallel redundancy multiplies the nines (two 99% paths ≈ 99.99%). But that math assumes independent failures; the real killer is correlated failure — a shared dependency, a bad deploy, a config push — that takes out all "redundant" copies at once.

The tradeoff

Redundancy costs money and adds coordination: failover must be fast and correct, and a botched promotion causes split-brain (two primaries). Active-active uses all capacity and fails over instantly but needs conflict handling; active-passive is simpler but wastes the standby and has a longer, riskier cutover. And redundancy only helps against independent faults — cell/blast-radius isolation is what limits correlated ones.

In the wild

Multi-AZ databases, active-passive and active-active deployments.

Interview deep dive

Flow

Run N+1 instances across independent failure domains.
Health-check continuously; detect a failed primary.
Promote a standby with fencing to avoid split-brain.
Re-replicate to restore the redundancy you just spent.

Watch for

Parallel-redundancy math assumes independent failures.
Correlated failure (shared dep, bad deploy) defeats redundant copies.
Failover time = detection + promotion — both must be fast.

Interviewer trap

Raise correlated failure — redundancy multiplies nines only if faults are independent.

Related Academy

Part of Academy on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →