Storage Engine (LSM vs B-Tree)
Design the on-disk structure a database uses to store and retrieve data.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Point + range reads
- Durable writes
- Crash recovery
- Compaction/maintenance
Non-functional
- Balance read/write/space amplification
Scale
TB per node
The approach
B-tree: in-place updates, great reads, used by most SQL DBs (Postgres/MySQL). LSM-tree: buffer writes in memory → flush sorted SSTables → compact in background; write-optimized, used by Cassandra/RocksDB. A write-ahead log gives durability/recovery for both.
Key components
WAL · memtable + SSTables (LSM) or page tree (B-tree) · compaction
Numbers that matter
- B-tree write amplification is typically 2–5× (write-ahead log + page write); LSM-tree write amplification is 10–30× depending on level multiplier and compaction strategy.
- RocksDB memtable flush threshold defaults to 64MB — once full, it flushes to an immutable L0 SSTable; L0 compaction to L1 triggers at 4 files by default.
- A B-tree page is typically 4KB–16KB; updating a single byte still rewrites the full page — on a 500GB DB this means random write IOPS are the bottleneck, not bandwidth.
- LSM-tree read amplification in the worst case is O(number of levels) — RocksDB with 7 levels may check a bloom filter at each level before finding a key, adding ~7 I/Os for a point lookup on a cold cache.
Senior deep-dive
B-tree owns reads; LSM-tree owns writes — the choice is determined by your write amplification budget and whether you can absorb compaction I/O.
Write amplification is the LSM-tree's hidden tax: a single user write can cause 10–30 disk writes across memtable flush and multiple compaction levels — acceptable for SSDs, punishing for spinning disks.
B-tree's random write problem (updating an in-place page causes random I/O) disappears on NVMe but is real on HDDs — which is why RocksDB/Cassandra's LSM dominates cloud workloads where sequential I/O is cheap.
B-tree: in-place update with write-ahead log
Every write to a B-tree updates the leaf page in place (and parent pages if splits occur) after writing to the WAL. The WAL is sequential, so it's fast; the actual page write is random I/O. On modern NVMe SSDs with 500k+ random IOPS, B-tree random write cost nearly vanishes — which is why Postgres performs well on NVMe. On spinning HDDs at 100 random IOPS, a B-tree at high write throughput saturates the drive. The page size vs working set ratio determines cache hit rate: a 16KB page with 80% cache hit means only 20% of reads hit disk.
LSM-tree: sequential writes at the cost of compaction
Writes first go to an in-memory memtable (a skip list or red-black tree). When the memtable fills, it's flushed as an immutable SSTable on disk. SSTables accumulate at L0; background compaction merges and sorts them into larger L1, L2... files. All disk writes are sequential — ideal for HDDs and efficient for SSDs (avoids write amplification at the flash layer). The cost is read amplification: a point lookup may need to check multiple SSTables at multiple levels if the key is not in the bloom filter cache.
Bloom filters: the read path optimization that matters
Without bloom filters, an LSM-tree point lookup for a missing key requires reading every SSTable at every level — O(levels) disk reads. A per-SSTable bloom filter answers "definitely not here" in microseconds, reducing average-case read amplification to near 1 for point lookups on existing keys. The filter false positive rate is configurable: 0.1% FPR requires ~10 bits/key; 1% requires ~7 bits/key. For a 1B key dataset at 10 bits/key that's ~1.25GB of filter data — fits in RAM on most servers. Forgetting to load bloom filters into memory on startup is a common ops mistake that causes 10–100× read latency until they warm.
Compaction strategies: level vs tiered
Leveled compaction (LevelDB/RocksDB default) keeps each level at a fixed size multiplier; a key exists in at most one SSTable per level, giving O(1) point lookups per level. Write amplification is high (~10–30×). Tiered/size-tiered compaction (Cassandra default) accumulates SSTables of similar size and merges them — lower write amplification (~3–10×) but more SSTables per level, higher read amplification. FIFO compaction is just for time-series data where oldest SSTables are deleted on TTL. The right choice: leveled for read-heavy with occasional writes, tiered for write-heavy append workloads.
Write stalls: the LSM operational hazard
When L0 SSTable count exceeds the stall threshold (default 20 in RocksDB), RocksDB deliberately slows down incoming writes to let compaction catch up — this is a write stall. When L0 hits the stop threshold (default 36), writes halt completely until compaction drains L0. Write stalls appear as sudden P99 spikes from <5ms to >100ms with no obvious cause. The mitigations are: increase compaction threads, tune the stall thresholds, ensure background compaction has dedicated I/O bandwidth, and use rate limiters on ingestion to never outpace compaction sustainably.
What breaks at scale
Space amplification is the LSM-tree's slow failure: until compaction runs, old versions of keys coexist with new versions in SSTables across levels. In the worst case, a database storing 100GB of live data may occupy 200–300GB on disk during active compaction. Tombstone accumulation for deleted keys also bloats disk — a deleted key's tombstone must survive until all SSTables with older versions of that key are compacted away, which in a leveled scheme can take hours. At scale, misconfigured compaction that can't keep up with deletes causes monotonically growing disk usage until the disk fills and the node dies.
In production
PostgreSQL and MySQL InnoDB use B+ trees (leaf nodes linked as a doubly-linked list for range scans) with a write-ahead log for crash recovery. Cassandra, RocksDB, and LevelDB use LSM-trees because their workloads are write-heavy and compaction can run during off-peak hours. The real challenge with LSM-trees is compaction scheduling: in RocksDB, compaction is background I/O that competes with foreground writes — a poorly tuned compaction can consume all disk bandwidth and stall writes for seconds, which is called a write stall, and it shows up as a sudden latency spike in your P99 dashboards. Pebble (CockroachDB's storage engine) was written specifically to have more predictable compaction behavior than RocksDB.
Common mistakes
- Assuming one structure fits all workloads
- Ignoring write/space amplification
- No WAL (no crash recovery)