System Design Library

Key-Value Store (DynamoDB)

A highly-available, partitioned, replicated key-value store with tunable consistency.

Open the interactive version → diagrams, practice & more

Requirements

Functional

get/put by key
Partitioned & replicated
Tunable consistency (N/R/W)

Non-functional

Always writable (AP-leaning)
Survives node/zone loss

Scale

Internet-scale

The approach

Consistent hashing for partitioning; each key replicated to N nodes; quorum reads/writes (R+W>N for strong, smaller for speed); conflicts resolved via vector clocks/last-write-wins; gossip for membership; hinted handoff for failures.

Key components

Coordinator node → replica set (N) · gossip membership · anti-entropy (Merkle)

Numbers that matter

DynamoDB's single-digit millisecond read/write latency holds at P99 < 5ms via SSD-backed replicas and in-memory caching (DAX adds ~microsecond layer).
A typical DynamoDB table is replicated to 3 availability zones with a quorum of W=2, R=2 out of N=3 — providing strong consistency without sacrificing availability in single-AZ failures.
Consistent hashing with virtual nodes: DynamoDB-style systems use ~hundreds of virtual nodes per physical node to balance load across heterogeneous hardware.
A hot partition in DynamoDB historically could handle ~1,000 WCU/s per partition — the 2018 adaptive capacity update relaxed this, but partition key design remains critical for avoiding hotspots.

Senior deep-dive

Consistent hashing for partitioning + quorum reads/writes is the entire consistency model — the magic of DynamoDB-style systems is that you can dial read/write quorums independently to trade consistency for latency.

Vector clocks (or version vectors) are the right tool for detecting divergent replicas, but they're complex enough that most production systems replace them with last-write-wins (LWW) on a per-attribute basis — simpler, lossy for concurrent writes, but operationally tractable.

Sloppy quorum + hinted handoff is how availability beats consistency: if a node is down, a different node accepts writes on its behalf (hint) and delivers them when the node recovers — this is the Dynamo innovation that makes it AP, not CP.

Consistent hashing: why the ring and virtual nodes

Modulo hashing (key % N) is the anti-pattern: add one node and you must remap N-1/N of all keys. A consistent hash ring maps both keys and nodes to a circular space — a key is owned by the first node clockwise. Adding a node moves only the keys between it and its predecessor. Virtual nodes place each physical node at multiple ring positions, smoothing out load imbalance from random ring placement. The trade-off: more vnodes = better balance, but larger routing tables and slower rebalancing metadata propagation.

Quorum reads/writes: the consistency dial

With N replicas, W writes required, and R reads required, strong consistency requires R + W > N. Common settings: N=3, W=2, R=2 (strong); N=3, W=1, R=1 (eventual, fast). Sloppy quorum is the Dynamo extension: if a target replica is down, a different node fills in, counting toward W — but the data is now on the 'wrong' node and must be handed off when the target recovers. Sloppy quorum sacrifices strict consistency for availability and is what makes Dynamo-style systems AP rather than CP in CAP terms.

Conflict resolution: vector clocks vs. LWW

Vector clocks track causality between writes — if write A happened-before write B, B wins; if concurrent, the application resolves the conflict. DynamoDB's original design used vector clocks; the production system moved to LWW (last-write-wins) on a per-item basis using a server-side timestamp. LWW is simple and fast but silently drops concurrent writes — two clients updating different attributes of the same item can have one update lost. DynamoDB's conditional writes (optimistic locking via version number) are the escape hatch for applications that need conflict detection.

Hinted handoff and anti-entropy: eventual consistency mechanics

Hinted handoff: when a replica is unavailable, the coordinator stores the write locally with a 'hint' (intended recipient) and delivers it when the recipient recovers. This keeps writes available during transient node failures. Anti-entropy with Merkle trees: each node computes a Merkle tree over its key ranges; nodes exchange tree hashes to find divergent subtrees, then sync only the divergent keys. This is how eventual consistency is enforced proactively rather than waiting for read-repair to fix inconsistencies.

Read repair and monotonic reads

Read repair detects inconsistency during a read: if two replicas return different values for the same key, the coordinator writes the newer version back to the stale replica. This is lazy anti-entropy — it heals inconsistency opportunistically. Monotonic reads (a client never sees older data after seeing newer data) require session tokens or sticky routing to a replica — a client that reads from replica A and then replica B during A's replication lag will see time go backward. DynamoDB strongly consistent reads always go to the leader to avoid this.

What breaks at scale

Hot partition keys are the most common production failure mode in DynamoDB deployments. A table with a partition key of `user_id` and a celebrity user with 10M followers funnels all activity-feed writes to one partition. Mitigation: write sharding — append a random suffix (0–N) to the partition key, distribute writes across N partitions, and aggregate on read. The second failure: LSM compaction falling behind write throughput (in Cassandra/LevelDB-backed stores) causes read amplification to spike as SSTables accumulate — compaction must be sized as a first-class resource, not an afterthought.

In production

Amazon DynamoDB is the canonical implementation; the original Dynamo paper (2007) describes the exact architecture: consistent hashing, vector clocks, sloppy quorum, hinted handoff, and Merkle tree anti-entropy. Apache Cassandra is the open-source analog, using the same consistent hashing ring but with CQL and a richer data model. The real engineering challenge is hot partition management: a poorly chosen partition key (e.g., using a timestamp as the partition key for time-series data) funnels all writes to one node, bypassing the distribution entirely — partition key design is the single most impactful decision in any Dynamo-style deployment.

Common mistakes

Assuming strong consistency by default
Ignoring conflict resolution
No anti-entropy (replicas drift forever)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →