System Design Library

Distributed Lock / Coordination (Chubby/ZK)

Provide distributed locks, leader election and config — correctly, through failures.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Locks/leases
  • Leader election
  • Watches
  • Small consistent config store

Non-functional

  • Strong consistency
  • Partition-safe
  • Highly available

Scale

Coordination for thousands of services

The approach

A small replicated state machine over a consensus protocol (Paxos/Raft); a majority quorum agrees on every change; clients hold time-bounded leases (not eternal locks) so a crashed client's lock expires; watches notify on change.

Key components

5-node consensus group (Raft/Paxos) · lease-based locks · clients

Numbers that matter

Senior deep-dive

Leases with bounded TTLs, not eternal locks: a client that holds a lock and then dies must not hold it forever — time-bounded leases (the client must renew before expiry) mean the system automatically recovers without explicit unlock messages.

Consensus (Paxos/Raft) underlies correctness: the lock service is only as reliable as its ability to agree on state through failures — the replicated state machine is the design, not an implementation detail.

Locks should be thin and rare: Chubby/ZK are designed for coarse-grained, infrequent locks (leader election, config distribution) — using them for per-row locking at high QPS is the canonical anti-pattern that kills clusters.

Leases beat locks: the key insight for distributed safety

An eternal lock in a distributed system is a liveness problem: if the holder crashes, the lock is held forever. Time-bounded leases solve this — the holder must renew before expiry or the lease is automatically released. The critical implication: a client must stop doing protected work before its lease expires, even if it can't contact the lock service to confirm expiry. This is why ZooKeeper ephemeral nodes (deleted when the client session dies) and Chubby leases both require clients to respect the lease duration boundary. Fencing tokens (an incrementing lock generation number) catch the edge case where a slow client resumes work after its lease expired but before it was told.

Paxos vs Raft: same correctness, different operability

Both guarantee linearizable consensus in the presence of minority failures. Paxos (Chubby) allows any node to propose, leading to Proposer conflicts (dueling proposers) that slow convergence — the Multi-Paxos optimization elects a distinguished proposer (de facto leader) to avoid this. Raft makes leader election explicit and separates it from log replication, making the algorithm easier to understand, implement correctly, and debug in production. The practical difference: Raft clusters (etcd, CockroachDB) are operationally simpler to reason about during partial failures because the leader/follower state is explicit and observable.

ZooKeeper ephemeral nodes: the distributed watchdog pattern

ZooKeeper ephemeral znodes (deleted when the client session expires) enable a pattern: every service instance creates `/services/my-service/instance-{id}` on startup. If the instance dies (or its network partitions long enough), ZooKeeper deletes the node, and watchers (load balancers, coordinators) receive a notification. This is service discovery and health checking built on top of a consistency primitive. The operational gotcha: ZooKeeper session expiry is not instantaneous — the default session timeout is 30 seconds, meaning a dead node stays in ZooKeeper's view for up to 30 seconds, which can cause load balancers to route to a dead instance.

The thundering herd on leader failover

When the Chubby/ZK master dies, all watching clients receive simultaneous notifications and attempt to reconnect, re-acquire locks, and re-elect leaders in parallel. The new leader may receive 10,000+ simultaneous reconnection requests before it has finished replaying the log and building in-memory state — causing it to also appear slow or dead, triggering another election. Production mitigations: jittered backoff on client reconnect (spread the storm over seconds), session resumption (clients reconnect to the same session ID if the timeout hasn't expired), and ensemble sizing (5 nodes not 3, so two can fail without losing quorum, reducing failover frequency).

Fencing tokens: when a client ignores its expired lease

The fencing token pattern solves the "slow client" problem: a process holds a lease, then GC pauses for 45 seconds (longer than the lease TTL), then resumes and tries to write to the protected resource — thinking it still holds the lock. The resource (e.g., a storage service) rejects any write with a token lower than the highest it has seen — because the new lock holder has a higher token. Without fencing tokens, two clients can both believe they hold the lock after one's lease expired unbeknownst to it. This is the difference between safety (fencing tokens) and liveness (leases) in distributed locking.

What breaks at scale

Chatty clients that create/delete ZooKeeper nodes at high frequency (per-request locking) overload the leader's write throughput (~5K writes/sec) and cause cascading delays — ZooKeeper is for coarse-grained coordination, not row-level locking. Watch notification storms after large cluster reconfigurations (e.g., rolling restart of 1,000 services) can saturate the ZooKeeper leader's notification dispatch, making it appear down to clients waiting for their own watches. Split-brain during network partition: if the minority partition has the old leader and it still has clients with active sessions that haven't expired, those clients may proceed with stale or conflicting state — this is why fencing tokens and lease boundaries must be enforced by the resource, not trusted client-side.

In production

Google Chubby locks the BigTable master election, Kubernetes uses etcd (Raft) for all cluster state and leader election, and Apache ZooKeeper has served Kafka's controller election, HBase's master, and Hadoop's NameNode HA for over a decade. The real engineering challenge is the thundering herd on leader transition: when a Chubby or ZooKeeper leader dies, all clients that were watching nodes get notified simultaneously and may all try to re-acquire locks or re-register — the watch notification storm can overwhelm the new leader before it has fully initialized. Production systems mitigate this with jittered reconnect delays and watch callback debouncing. etcd's design explicitly limits watch fan-out overhead by streaming incremental updates (revision-based) rather than delivering full state snapshots.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →