Redundancy & failover
Anything can fail: a disk, a server, a rack, a whole region. A single copy of anything is a time bomb.
Open the interactive version → diagrams, practice & moreThe problem
Anything can fail: a disk, a server, a rack, a whole region. A single copy of anything is a time bomb.
The idea
Eliminate single points of failure with redundant components and automatic failover.
How it works
Eliminate single points of failure with N+1 instances across failure domains (zones, racks, power) and automatic failover: health-check, then promote a standby when the primary fails. Availability is series/parallel math — components in series multiply failure probabilities, parallel redundancy multiplies the nines (two 99% paths ≈ 99.99%). But that math assumes independent failures; the real killer is correlated failure — a shared dependency, a bad deploy, a config push — that takes out all "redundant" copies at once.
The tradeoff
Redundancy costs money and adds coordination: failover must be fast and correct, and a botched promotion causes split-brain (two primaries). Active-active uses all capacity and fails over instantly but needs conflict handling; active-passive is simpler but wastes the standby and has a longer, riskier cutover. And redundancy only helps against independent faults — cell/blast-radius isolation is what limits correlated ones.
In the wild
Multi-AZ databases, active-passive and active-active deployments.
Interview deep dive
Flow
- Run N+1 instances across independent failure domains.
- Health-check continuously; detect a failed primary.
- Promote a standby with fencing to avoid split-brain.
- Re-replicate to restore the redundancy you just spent.
Watch for
- Parallel-redundancy math assumes independent failures.
- Correlated failure (shared dep, bad deploy) defeats redundant copies.
- Failover time = detection + promotion — both must be fast.
Interviewer trap
Raise correlated failure — redundancy multiplies nines only if faults are independent.