System Design Library

Distributed Cache (Redis)

A horizontally-scaled in-memory cache with sharding, replication and eviction.

Open the interactive version → diagrams, practice & more

Requirements

Functional

get/set/del with TTL
Sharded across nodes
Replication
Eviction (LRU/LFU)

Non-functional

Sub-ms
Available under node loss
Even distribution

Scale

TB of hot data

The approach

Consistent hashing (with virtual nodes) maps keys → nodes so membership changes move few keys; each shard has replicas for failover; LRU/LFU eviction; client or proxy routes keys.

Key components

Client/proxy → hash ring → cache nodes (+ replicas)

Numbers that matter

Redis on a single node serves ~100k–500k ops/sec at <1ms P99; a cluster of 10 nodes scales this linearly for independent key spaces.
Virtual nodes per physical node: 150–300 is the standard Redis Cluster and Cassandra configuration — this gives a standard deviation of key distribution of ~10% across nodes.
LRU eviction with a maxmemory-policy in Redis can use ~10–20 bytes of overhead per key beyond the value, so key-count planning is as important as byte-count planning.
Consistent hashing moves ~1/N of keys when a node is added to an N-node cluster — adding a node to a 10-node cluster moves ~10% of keys; naive modulo hashing moves ~90%.

Senior deep-dive

Consistent hashing with virtual nodes is not optional at scale — without it, adding one node to a 10-node cluster moves 90% of keys; with ~150 virtual nodes per physical node, the move fraction drops to ~10%.

The cache invalidation problem is not a cache problem, it's a write path design problem: the source of truth (DB) must be the one that drives invalidation — cache-aside with TTL is the safe default, but stale reads during TTL windows are a business decision, not a technical failure.

Replication lag in async replica caches causes split-brain reads: a write goes to the primary, a read hits a replica before replication completes — this is why read-your-own-writes consistency requires sticky routing to the primary for a short window after a write.

Consistent hashing: virtual nodes are the real algorithm

The naive consistent hashing ring puts each physical node once on the ring — load is wildly uneven because a few lucky positions cover more key space. Virtual nodes (vnodes) place each physical node at 150–300 positions, giving a statistical distribution that converges to even. When a node is added, it takes over some vnodes from its neighbors; only the keys in those vnode ranges move (~1/N of total keys). Redis Cluster implements this via 16,384 fixed hash slots assigned to nodes — conceptually identical, just discretized.

Cache invalidation: the write path owns this problem

Cache-aside (read: check cache → miss → fetch DB → populate cache; write: update DB → delete cache key) is the correct default pattern. Never update the cache on write — it creates a race between the DB write and the cache write where the cache can end up with stale data. Delete, don't update: the next read will repopulate. The subtle failure mode: if the cache delete fails after a DB write, you've permanently stale-cached a key until TTL expires. A write-through design (write to cache and DB atomically) avoids this but adds write latency.

Thundering herd: the most common production incident

When a hot key expires, hundreds of concurrent cache misses all try to fetch the same DB record. The DB sees an instantaneous spike it wasn't designed for. Mutex/lock-based solutions let one request fetch and populate the cache while others wait — but this requires distributed coordination. Probabilistic early expiration (PER) recomputes a value slightly before it expires based on a probability function — it spreads the refresh load over time. Jitter on TTLs (adding random ±10% to expiry times) is the simplest mitigation: prevents synchronized mass expiration.

Eviction policies: the memory is not free

LRU (Least Recently Used) is the default intuition but Redis's approximated LRU (sample 5 random keys, evict the oldest) is cheaper than exact LRU. LFU (Least Frequently Used) is better for workloads with stable hot keys that haven't been accessed in a few seconds — LRU would wrongly evict them. allkeys-lru vs. volatile-lru: allkeys evicts any key; volatile only evicts keys with a TTL set. In practice, set TTLs on everything and use volatile-lru — it prevents permanent key accumulation and gives the eviction policy something to work with.

Replication and consistency: primary vs. replica reads

Redis replica replication is asynchronous by default — a write to the primary is acknowledged before replicas have it. Reading from a replica immediately after a write can return stale data. For read-your-own-writes consistency, route reads to the primary for a short window (1–2 seconds) after a write, then fall back to replicas. For strictly consistent reads, WAIT command blocks until N replicas acknowledge — available but rarely used in practice due to latency cost.

What breaks at scale

Hot keys — a single key receiving millions of RPS — cannot be solved by adding nodes (the key always lands on one node). Mitigations: key replication (write to K copies with a suffix, read from a random copy), or local in-process caching of the hot key for 100ms. The second failure mode: Redis Cluster resharding under live traffic — moving slots while serving traffic causes key-miss spikes as the slot moves between nodes. Redis handles this with MIGRATING/IMPORTING states on affected slots, but the migration window must be monitored — a stalled migration leaves a slot in a split state and breaks reads for those keys.

In production

Redis Cluster uses CRC16 mod 16384 hash slots assigned to nodes (a form of consistent hashing with fixed slot count) — this simplifies slot reassignment but requires a fixed slot table rather than a pure ring. Memcached uses client-side consistent hashing (no server coordination), making it simpler but unable to handle server failures automatically. The real engineering challenge is thundering herd on cache miss: when a popular key expires, thousands of concurrent requests miss the cache simultaneously and hammer the DB — mutex locking (only one request fetches, others wait) or probabilistic early expiration are the standard mitigations.

Common mistakes

key % N hashing (mass reshuffle on scale)
No virtual nodes (hot shards)
Ignoring the thundering-herd on cache miss

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →