System Design Library

Probabilistic Structures (Bloom / HLL)

Answer "have I seen this?" and "how many unique?" at huge scale with tiny memory.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Set membership (maybe)
Cardinality estimate
Mergeable

Non-functional

Tiny memory
Fast
Bounded error

Scale

Billions of items

The approach

Bloom filter: a bit array + k hashes gives "definitely not in set" or "probably in set" (no false negatives) in a few bits/item. HyperLogLog estimates distinct counts from hash-bit patterns in ~KB regardless of cardinality; both merge across shards.

Key components

In-memory bit/register arrays per shard; mergeable

Numbers that matter

A Bloom filter needs ~9.6 bits/element for a 1% false-positive rate, vs. ~64 bits for a minimal hash set — 6–7x space savings.
HyperLogLog achieves ~2% relative error with 1.5 KB of memory, regardless of whether you have 1 million or 1 billion distinct values.
Count-Min Sketch over-estimates frequencies by at most ε × N with probability 1-δ; setting ε=0.1%, δ=0.01% needs a ~20 KB table.
Cassandra uses Bloom filters per SSTable — at 10 SSTables per partition, they reduce disk I/O for missing-key reads from 10 seeks to ~1.01 on average.

Senior deep-dive

You trade a bounded false-positive rate for massive space savings — a Bloom filter uses ~10 bits/element for a 1% FPR vs. a hash set at 64+ bits/element.

HyperLogLog estimates distinct counts to within 2% error using a fixed ~1.5 KB of memory regardless of dataset size — the maths exploit hash uniformity, not sampling.

The trap is treating 'probably in set' as 'definitely in set' — Bloom filters have no false negatives but do have false positives, so they gate a more expensive lookup, not replace it.

Bloom filter mechanics: k hashes, one bit array

Insert hashes the element with k independent hash functions, each setting one bit. Lookup checks all k bits — if any is 0, definitely absent (no false negatives). If all are 1, probably present (false positive if those bits were set by other elements). Optimal k = (m/n) × ln(2) where m is bits and n is elements. In practice, use two hash functions and linearly combine them — computationally cheaper, nearly equivalent.

HyperLogLog: why hash leading zeros predict cardinality

HLL hashes each element and observes the maximum number of leading zeros across all hashes. A run of k leading zeros suggests the distinct count is ~2^k. Multiple registers (HLL uses 2^b buckets, b≈14 → 16,384 registers) reduce variance by averaging. Harmonic mean across registers corrects for range bias. The insight is that hash uniformity makes leading-zero distribution a reliable cardinality estimator — no raw data stored.

Count-Min Sketch: frequency estimation with sub-linear space

A 2D array of w×d counters with d independent hash functions — increment all d counters for each item, query by taking the minimum across all d rows (over-counts cancel, minimum gives the least-biased estimate). This is the structure behind heavy hitter detection in stream processing (finding top-k frequent items). Unlike Bloom filters, it works on weighted events, making it ideal for ad-click frequency or DDoS source throttling.

Where false positives cost you: the gate pattern

The correct pattern is Bloom → authoritative lookup: if the Bloom says 'absent', skip the disk/DB access; if 'present', do the real lookup to confirm. Systems that skip the confirmation step introduce phantom reads — showing users items that don't exist. Google Spanner uses this in its key existence checks; Cassandra is explicit that the Bloom is a hint, not an answer.

Scalable and Counting variants for real workloads

Classic Bloom filters are not deletable — clearing a bit could unset a bit shared by other elements. Counting Bloom filters replace each bit with a small counter, enabling deletes at 4x space cost. Scalable Bloom filters chain multiple filters, each with tighter FPR, so the combined FPR stays bounded as the set grows — useful when you can't know the final element count at build time.

What breaks at scale

Hash function quality breaks everything: correlated hashes inflate FPR beyond the theoretical bound — MurmurHash3 and xxHash are production-safe; MD5/SHA1 are overkill and slower. Shared Bloom filters across services become a coordination problem — when does it get rebuilt, who owns the rebuild, what happens during the rebuild window where FPR spikes? Distribute the filter as a versioned artifact, not a live shared structure.

In production

Apache Cassandra ships a per-SSTable Bloom filter that eliminates most disk seeks for non-existent rows — without it, a read for a missing key would require scanning every SSTable. Google Bigtable / LevelDB uses the same pattern. Redis HyperLogLog (the `PFADD`/`PFCOUNT` commands) is used by companies like Twitter and Cloudflare to count unique visitors per URL at billions-per-day scale without storing the raw set. The real challenge is parameter tuning: a Bloom filter sized for 1M elements becomes useless (FPR → 100%) if the set grows to 10M — you need either a dynamic/scalable variant or a rebuild pipeline.

Common mistakes

Expecting exactness (they're probabilistic)
Sizing the Bloom filter too small (false-positive blowup)
Forgetting they can't delete (basic Bloom)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →