Distributed Lock / Coordination (Chubby/ZK)
Provide distributed locks, leader election and config — correctly, through failures.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Locks/leases
- Leader election
- Watches
- Small consistent config store
Non-functional
- Strong consistency
- Partition-safe
- Highly available
Scale
Coordination for thousands of services
The approach
A small replicated state machine over a consensus protocol (Paxos/Raft); a majority quorum agrees on every change; clients hold time-bounded leases (not eternal locks) so a crashed client's lock expires; watches notify on change.
Key components
5-node consensus group (Raft/Paxos) · lease-based locks · clients
Numbers that matter
- Chubby is designed for ~90,000 clients per cell (cluster of 5 replicas); it handles ~10,000 read requests/sec but is explicitly not designed for high-throughput lock/unlock cycles.
- Chubby's lock lease duration is ~12 seconds by default; clients renew every ~7 seconds — leaving a safety margin. A client network partition lasting >12 seconds loses its lock, a deliberate correctness choice over availability.
- ZooKeeper achieves ~10,000–100,000 reads/sec per ensemble with local reads (reads served by any node, potentially stale) and ~5,000–10,000 writes/sec (linearizable through leader).
- Raft leader election completes in ~150–300 ms with default timeouts (election timeout randomized between 150–300 ms); etcd clusters typically settle new terms in under 1 second in real deployments.
Senior deep-dive
Leases with bounded TTLs, not eternal locks: a client that holds a lock and then dies must not hold it forever — time-bounded leases (the client must renew before expiry) mean the system automatically recovers without explicit unlock messages.
Consensus (Paxos/Raft) underlies correctness: the lock service is only as reliable as its ability to agree on state through failures — the replicated state machine is the design, not an implementation detail.
Locks should be thin and rare: Chubby/ZK are designed for coarse-grained, infrequent locks (leader election, config distribution) — using them for per-row locking at high QPS is the canonical anti-pattern that kills clusters.
Leases beat locks: the key insight for distributed safety
An eternal lock in a distributed system is a liveness problem: if the holder crashes, the lock is held forever. Time-bounded leases solve this — the holder must renew before expiry or the lease is automatically released. The critical implication: a client must stop doing protected work before its lease expires, even if it can't contact the lock service to confirm expiry. This is why ZooKeeper ephemeral nodes (deleted when the client session dies) and Chubby leases both require clients to respect the lease duration boundary. Fencing tokens (an incrementing lock generation number) catch the edge case where a slow client resumes work after its lease expired but before it was told.
Paxos vs Raft: same correctness, different operability
Both guarantee linearizable consensus in the presence of minority failures. Paxos (Chubby) allows any node to propose, leading to Proposer conflicts (dueling proposers) that slow convergence — the Multi-Paxos optimization elects a distinguished proposer (de facto leader) to avoid this. Raft makes leader election explicit and separates it from log replication, making the algorithm easier to understand, implement correctly, and debug in production. The practical difference: Raft clusters (etcd, CockroachDB) are operationally simpler to reason about during partial failures because the leader/follower state is explicit and observable.
ZooKeeper ephemeral nodes: the distributed watchdog pattern
ZooKeeper ephemeral znodes (deleted when the client session expires) enable a pattern: every service instance creates `/services/my-service/instance-{id}` on startup. If the instance dies (or its network partitions long enough), ZooKeeper deletes the node, and watchers (load balancers, coordinators) receive a notification. This is service discovery and health checking built on top of a consistency primitive. The operational gotcha: ZooKeeper session expiry is not instantaneous — the default session timeout is 30 seconds, meaning a dead node stays in ZooKeeper's view for up to 30 seconds, which can cause load balancers to route to a dead instance.
The thundering herd on leader failover
When the Chubby/ZK master dies, all watching clients receive simultaneous notifications and attempt to reconnect, re-acquire locks, and re-elect leaders in parallel. The new leader may receive 10,000+ simultaneous reconnection requests before it has finished replaying the log and building in-memory state — causing it to also appear slow or dead, triggering another election. Production mitigations: jittered backoff on client reconnect (spread the storm over seconds), session resumption (clients reconnect to the same session ID if the timeout hasn't expired), and ensemble sizing (5 nodes not 3, so two can fail without losing quorum, reducing failover frequency).
Fencing tokens: when a client ignores its expired lease
The fencing token pattern solves the "slow client" problem: a process holds a lease, then GC pauses for 45 seconds (longer than the lease TTL), then resumes and tries to write to the protected resource — thinking it still holds the lock. The resource (e.g., a storage service) rejects any write with a token lower than the highest it has seen — because the new lock holder has a higher token. Without fencing tokens, two clients can both believe they hold the lock after one's lease expired unbeknownst to it. This is the difference between safety (fencing tokens) and liveness (leases) in distributed locking.
What breaks at scale
Chatty clients that create/delete ZooKeeper nodes at high frequency (per-request locking) overload the leader's write throughput (~5K writes/sec) and cause cascading delays — ZooKeeper is for coarse-grained coordination, not row-level locking. Watch notification storms after large cluster reconfigurations (e.g., rolling restart of 1,000 services) can saturate the ZooKeeper leader's notification dispatch, making it appear down to clients waiting for their own watches. Split-brain during network partition: if the minority partition has the old leader and it still has clients with active sessions that haven't expired, those clients may proceed with stale or conflicting state — this is why fencing tokens and lease boundaries must be enforced by the resource, not trusted client-side.
In production
Google Chubby locks the BigTable master election, Kubernetes uses etcd (Raft) for all cluster state and leader election, and Apache ZooKeeper has served Kafka's controller election, HBase's master, and Hadoop's NameNode HA for over a decade. The real engineering challenge is the thundering herd on leader transition: when a Chubby or ZooKeeper leader dies, all clients that were watching nodes get notified simultaneously and may all try to re-acquire locks or re-register — the watch notification storm can overwhelm the new leader before it has fully initialized. Production systems mitigate this with jittered reconnect delays and watch callback debouncing. etcd's design explicitly limits watch fan-out overhead by streaming incremental updates (revision-based) rather than delivering full state snapshots.
Common mistakes
- Locks without lease timeouts (deadlock on crash)
- Using it as a general database (it's tiny & slow on purpose)
- Even-sized clusters (no clean majority)