Raft Key-Value Store (etcd)
A small, strongly-consistent KV store for config, locks and coordination.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Linearizable get/put
- Watches
- Leases/locks
- Cluster membership
Non-functional
- Strong consistency
- Partition-safe
- Highly available
Scale
Coordination layer
The approach
A replicated state machine over Raft: a leader appends entries to a majority before commit; reads are linearizable (via leader/lease); watches stream changes; leases expire locks on client death.
Key components
Raft cluster (odd size) · leader · replicated log
Numbers that matter
- etcd's recommended value size limit is 1.5MB per key; the default maximum request size is 1.5MB — storing large blobs violates the design contract and degrades Raft log performance.
- A well-tuned 3-node etcd cluster can handle ~10,000 writes/sec sustained; a 5-node cluster handles less writes (more quorum overhead) but tolerates 2 failures instead of 1.
- etcd's Raft heartbeat interval defaults to 100ms and election timeout to 1000ms; in a network with >200ms RTT between nodes you must tune these up or spurious leader elections destabilize the cluster.
- Kubernetes clusters with >5,000 nodes typically require a dedicated etcd cluster for events (which generate high write volume) separate from the main cluster state to avoid Raft log backpressure.
Senior deep-dive
Raft consensus is etcd's entire value proposition: every write goes through a leader that replicates to a quorum of followers before acknowledging — the client gets linearizability, not eventual consistency, which is why etcd is used for Kubernetes control plane state and not a general-purpose cache.
etcd is optimized for small, infrequent writes and many reads — it is not a data store for large payloads or high-throughput writes; values should be <1MB and write rates should be <<1000/sec or you will saturate the Raft log.
Lease-based locks expire automatically — a client that crashes or is network-partitioned loses its lease within the TTL, preventing the indefinite lock-hold that plagues 2PC.
Raft: why a single leader is the right choice for coordination
Raft's leader is the sole writer to the log — all writes go to the leader, which eliminates write conflicts but creates a fan-out bottleneck: every write replicates to a quorum before returning to the client. For coordination workloads (leader election, config, locks) this is fine — write volume is low. The non-obvious benefit: reads from the leader are linearizable by default (the leader checks its commit index before serving), which is what clients of a coordination service actually need.
Watches: the subscription primitive
etcd's Watch API streams change events for a key or key prefix from a given revision, enabling controllers to react to state changes without polling. Watches are held as long-lived gRPC streams on the server and use the MVCC revision as the resume point — if a client disconnects and reconnects, it resumes from the last-seen revision, missing no events. The Kubernetes controller pattern depends entirely on this: every controller is a watch loop that reconciles desired vs actual state.
Leases: bounded distributed locks
A lease is a time-bounded grant from the etcd cluster; keys can be attached to a lease and are automatically deleted when the lease expires. This is how Kubernetes implements pod leaders and how distributed locks work: acquire a lease, write your identity under a leased key, and refresh the lease TTL heartbeat. A client that crashes or is partitioned fails to refresh, the TTL expires, and the key is deleted — no coordinator intervention needed. The failure mode is too-short TTLs causing spurious lock loss under GC pauses or network hiccups.
MVCC and compaction: mandatory maintenance
etcd stores every revision of every key in a BoltDB B-tree (the `db` file on disk). Without compaction, this file grows until disk exhaustion — every `Put` adds a new revision, and old revisions pile up. Compaction deletes revisions older than a specified revision number; defragmentation reclaims the free pages in the BoltDB file (compaction marks pages free but doesn't shrink the file). Defragmentation is an online operation that briefly holds a write lock — it must be scheduled during low-traffic windows.
Cluster sizing: 3 vs 5 nodes
A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. More nodes means more quorum latency (write must replicate to 3 of 5 vs 2 of 3) and lower write throughput. The production pattern for Kubernetes: 3 nodes for small clusters, 5 nodes for large/critical clusters, never 4 or 6 (even numbers are strictly worse — same fault tolerance as N-1 with more overhead). etcd is a CP system: under partition, the majority partition continues and the minority partition rejects all writes.
What breaks at scale
Disk I/O latency is the most common etcd production problem: etcd measures Raft log fsync latency; if the underlying disk is too slow (>10ms p99) or shared with other workloads, leader elections become frequent and the cluster becomes unstable. Kubernetes event storms (many failing pods emitting events) overwhelm a shared etcd with write volume — the fix is a dedicated events etcd cluster. Large value writes (e.g., storing a large ConfigMap) cause Raft log entries that take hundreds of milliseconds to replicate, blocking all other writes behind them in the single leader queue.
In production
Kubernetes stores all cluster state (pods, services, configmaps, secrets) in etcd — every `kubectl apply` is a write to etcd via the API server, and every controller watch is an etcd watch stream. Consul (HashiCorp) provides a similar Raft-backed KV plus service mesh and health checking in one product. ZooKeeper predates etcd and uses ZAB (ZooKeeper Atomic Broadcast) instead of Raft — functionally equivalent but operationally harder and Java-based. The real challenge in production is etcd compaction and defragmentation: the MVCC store keeps old revisions forever until compacted, causing the db file to grow until disk fills; scheduled compaction + defrag is mandatory maintenance.
Common mistakes
- Using it as a high-throughput database
- Even-sized clusters (no clean majority)
- Locks without leases (deadlock on crash)