System Design Library

Wide-Column Store (Cassandra)

A write-optimized, masterless, partitioned column store with tunable consistency.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Partitioned by key
Masterless writes
Tunable consistency
Write-heavy

Non-functional

Always writable
Linear scale
Survives node loss

Scale

PB across clusters

The approach

Consistent-hashing ring (no master); writes go to any node, replicated to N, with quorum tunables; storage is LSM-tree (memtable → SSTable, compaction) optimized for writes; gossip + read-repair + Merkle anti-entropy.

Key components

Any node (coordinator) → replicas · LSM storage · gossip

Numbers that matter

Cassandra sustains ~1 million writes/sec per node in write-optimized configurations (ONE consistency, async commit log) — writes to memtable are effectively RAM speed.
With RF=3 and LOCAL_QUORUM (R=2, W=2), the cluster tolerates one node failure in a DC while staying both readable and writable — the canonical production setting.
SSTable bloom filters are ~6 bits/element (at ~1% false-positive rate); without them, a partition key miss touches every SSTable on disk — on a busy node that can mean dozens of files.
Tombstones older than `gc_grace_seconds` (default 10 days) are eligible for purge during compaction — a table with heavy deletes that never compacts accumulates tombstones that cause read timeouts.

Senior deep-dive

The write path is a lie about durability: Cassandra acknowledges a write after memtable + commit log, before any SSTable flush — the system is durable only if the commit log survives node death.

Quorum is the consistency dial: R + W > N gives strong consistency, but ALL is a latency trap and ONE is a data-loss trap — the operationally safe default is LOCAL_QUORUM in a multi-DC setup.

Compaction is the hidden cost: LSM-tree SSTables accumulate and reads touch multiple files until compaction merges them — read amplification spikes without tuned compaction strategy.

Data modeling is the only hard part

Cassandra forces you to model around queries, not entities. The partition key determines which node holds the data; the clustering key determines sort order within a partition. Get the partition key wrong — too coarse (hot partition) or too fine (too many small partitions, expensive scatter-gather) — and no amount of hardware fixes it. Wide partitions (millions of rows per partition) cause GC pressure and slow compaction; tombstone-heavy partitions cause read timeouts. The table design must be revisited every time a new access pattern emerges.

Leaderless replication: coordinator routes, not controls

Every Cassandra node can be a coordinator for any request — there is no master. The coordinator uses the consistent hashing ring (with virtual nodes) to identify the replica set for a partition key and routes the request. This means writes go directly to all N replicas in parallel (no replication chain). The upside is linear write scalability; the downside is conflicting writes on the same key arriving in different orders at different replicas — last-write-wins (LWW) by timestamp resolves this, but wall-clock skew between nodes can produce surprises.

LSM tree: write-optimized at a read cost

Writes land in an in-memory memtable and the commit log (sequential disk write). When the memtable fills it flushes to an immutable SSTable on disk. Reads must check the memtable + every SSTable that could contain the key, filtered by bloom filter and row-level index. Read amplification grows as SSTables accumulate between compaction runs. STCS (Size-Tiered Compaction) batches similar-sized SSTables but spikes disk usage (2x space during merge); LCS (Leveled) keeps amplification low but doubles write I/O. Choose compaction strategy based on read:write ratio, not habit.

Lightweight transactions (LWT): Paxos hidden inside

Cassandra supports compare-and-swap via Paxos (`IF NOT EXISTS`, `IF col = val`) — so-called Lightweight Transactions. They provide linearizable consistency for a single partition but at ~4x the latency of a regular write (4 round-trips: prepare, promise, propose, commit). LWTs are correct but expensive; using them for high-throughput paths (e.g., seat reservation) will saturate the cluster. Use them only when conditional writes are truly required, and model the data to minimize LWT-needing operations.

Repair: the unsexy operation that keeps data correct

Because Cassandra is eventually consistent, replicas drift over time — a node that was down misses writes and comes back with stale data. Anti-entropy repair (Merkle-tree comparison between replicas) identifies and syncs divergent ranges. If you don't run regular repairs, replicas diverge permanently. The rule of thumb: repair every node at least once per gc_grace_seconds window (default 10 days) or deleted data can resurrect (a tombstone pruned before a lagging replica sees the original write). Cassandra Reaper is the standard tool for this; skipping it is how teams wake up to ghost rows.

What breaks at scale

Hot partitions are the #1 scale failure — a celebrity user or a single high-cardinality time bucket receiving 10× the write rate of other partitions overloads two or three nodes while the rest sit idle. Tombstone floods on tables with high delete rates (TTL expiry, rolling windows) cause read timeouts that look like node failures but survive node restarts — the fix is aggressive compaction scheduling and shorter TTLs. Finally, schema changes are not free: adding a column or index across a large cluster propagates via gossip and can take minutes, during which nodes may disagree on schema version and refuse coordinator requests — always gate schema changes behind maintenance windows or feature flags.

In production

Apple runs 75,000+ Cassandra nodes for iCloud; Netflix uses it for viewing history and billing. The real engineering challenge isn't writes — it's modeling for reads: Cassandra has no secondary indexes worth trusting in production (they're per-node scans), no joins, and no ad-hoc queries. Every read access pattern must be a primary key lookup or a partition scan, so you design a table per query (denormalize aggressively). Tombstone accumulation on time-series data with rolling deletes is the most common production outage cause — reads scan tombstones before finding live data, and a partition with millions of tombstones can time out even with zero live rows.

Common mistakes

Modeling like a relational DB (joins/secondary indexes)
Ignoring partition-key design (hot partitions)
Expecting strong consistency by default

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →