System Design Library

Raft Key-Value Store (etcd)

A small, strongly-consistent KV store for config, locks and coordination.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Linearizable get/put
  • Watches
  • Leases/locks
  • Cluster membership

Non-functional

  • Strong consistency
  • Partition-safe
  • Highly available

Scale

Coordination layer

The approach

A replicated state machine over Raft: a leader appends entries to a majority before commit; reads are linearizable (via leader/lease); watches stream changes; leases expire locks on client death.

Key components

Raft cluster (odd size) · leader · replicated log

Numbers that matter

Senior deep-dive

Raft consensus is etcd's entire value proposition: every write goes through a leader that replicates to a quorum of followers before acknowledging — the client gets linearizability, not eventual consistency, which is why etcd is used for Kubernetes control plane state and not a general-purpose cache.

etcd is optimized for small, infrequent writes and many reads — it is not a data store for large payloads or high-throughput writes; values should be <1MB and write rates should be <<1000/sec or you will saturate the Raft log.

Lease-based locks expire automatically — a client that crashes or is network-partitioned loses its lease within the TTL, preventing the indefinite lock-hold that plagues 2PC.

Raft: why a single leader is the right choice for coordination

Raft's leader is the sole writer to the log — all writes go to the leader, which eliminates write conflicts but creates a fan-out bottleneck: every write replicates to a quorum before returning to the client. For coordination workloads (leader election, config, locks) this is fine — write volume is low. The non-obvious benefit: reads from the leader are linearizable by default (the leader checks its commit index before serving), which is what clients of a coordination service actually need.

Watches: the subscription primitive

etcd's Watch API streams change events for a key or key prefix from a given revision, enabling controllers to react to state changes without polling. Watches are held as long-lived gRPC streams on the server and use the MVCC revision as the resume point — if a client disconnects and reconnects, it resumes from the last-seen revision, missing no events. The Kubernetes controller pattern depends entirely on this: every controller is a watch loop that reconciles desired vs actual state.

Leases: bounded distributed locks

A lease is a time-bounded grant from the etcd cluster; keys can be attached to a lease and are automatically deleted when the lease expires. This is how Kubernetes implements pod leaders and how distributed locks work: acquire a lease, write your identity under a leased key, and refresh the lease TTL heartbeat. A client that crashes or is partitioned fails to refresh, the TTL expires, and the key is deleted — no coordinator intervention needed. The failure mode is too-short TTLs causing spurious lock loss under GC pauses or network hiccups.

MVCC and compaction: mandatory maintenance

etcd stores every revision of every key in a BoltDB B-tree (the `db` file on disk). Without compaction, this file grows until disk exhaustion — every `Put` adds a new revision, and old revisions pile up. Compaction deletes revisions older than a specified revision number; defragmentation reclaims the free pages in the BoltDB file (compaction marks pages free but doesn't shrink the file). Defragmentation is an online operation that briefly holds a write lock — it must be scheduled during low-traffic windows.

Cluster sizing: 3 vs 5 nodes

A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. More nodes means more quorum latency (write must replicate to 3 of 5 vs 2 of 3) and lower write throughput. The production pattern for Kubernetes: 3 nodes for small clusters, 5 nodes for large/critical clusters, never 4 or 6 (even numbers are strictly worse — same fault tolerance as N-1 with more overhead). etcd is a CP system: under partition, the majority partition continues and the minority partition rejects all writes.

What breaks at scale

Disk I/O latency is the most common etcd production problem: etcd measures Raft log fsync latency; if the underlying disk is too slow (>10ms p99) or shared with other workloads, leader elections become frequent and the cluster becomes unstable. Kubernetes event storms (many failing pods emitting events) overwhelm a shared etcd with write volume — the fix is a dedicated events etcd cluster. Large value writes (e.g., storing a large ConfigMap) cause Raft log entries that take hundreds of milliseconds to replicate, blocking all other writes behind them in the single leader queue.

In production

Kubernetes stores all cluster state (pods, services, configmaps, secrets) in etcd — every `kubectl apply` is a write to etcd via the API server, and every controller watch is an etcd watch stream. Consul (HashiCorp) provides a similar Raft-backed KV plus service mesh and health checking in one product. ZooKeeper predates etcd and uses ZAB (ZooKeeper Atomic Broadcast) instead of Raft — functionally equivalent but operationally harder and Java-based. The real challenge in production is etcd compaction and defragmentation: the MVCC store keeps old revisions forever until compacted, causing the db file to grow until disk fills; scheduled compaction + defrag is mandatory maintenance.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →