Sorted Store (BigTable)
A sorted, sparse, distributed map from (row, column, time) → value at petabyte scale.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Sorted by row key
- Sparse wide columns
- Range scans
- Versioned cells
Non-functional
- Petabyte scale
- Fast range reads
Scale
PB; billions of rows
The approach
Rows sorted lexicographically and split into tablets (ranges) served by tablet servers; LSM storage (memtable + SSTables) on a distributed file system; a master assigns tablets; clients cache tablet locations.
Key components
Master (tablet assignment) → tablet servers · SSTables on GFS · Chubby for coordination
Numbers that matter
- Google's production Bigtable clusters (as of the original 2006 paper) served ~1 million tablets per cluster, with each tablet ranging from ~100 MB to ~200 MB before splitting.
- Row key lookups using the three-level METADATA tablet location hierarchy involve at most 3 network round-trips (root tablet → METADATA tablet → user tablet server) for a cold cache; subsequent requests hit the cached tablet server directly.
- Bigtable's SSTable format achieves compression ratios of ~10:1 on web data using block-level LZ compression (64 KB blocks), because web content (HTML, URLs) compresses extremely well and similar-row data clusters by key.
- A single Bigtable tablet server handles ~1,000 tablets and sustains ~10,000–50,000 read/write QPS per server in balanced read/write workloads; servers typically have ~4 GB RAM for memtables and block caches.
Senior deep-dive
Lexicographic row key order is the killer feature: rows are sorted, so a well-designed key (e.g., `user#timestamp`) gives you efficient range scans for time-series or prefix lookups without a secondary index.
Tablets are the unit of everything: serving, splitting, load balancing, and failure recovery all operate at the tablet (contiguous row range) level — the master assigns tablets to tablet servers, and clients cache that assignment.
The tablet server is a near-stateless worker: tablets are stored on GFS/Colossus, not locally — a failed tablet server loses nothing; the master reassigns its tablets to other servers in seconds.
Row key as the only index: design it right or redesign everything
Bigtable's single sorted index (the row key) means every query must be expressible as a point lookup or prefix/range scan on that key. The canonical Bigtable anti-pattern is using a monotonically increasing key (auto-increment, timestamp) — this sends all writes to the tablet currently covering the high end of the keyspace, creating a hotspot. The fix is a salted prefix (hash prefix + natural key) or a deliberately reversed key. Column families are your second modeling tool: group columns accessed together into the same family (collocated in one SSTable), separate rarely-accessed large BLOBs into a separate family to avoid reading them on every row fetch.
Tablet lifecycle: split, merge, reassign
A tablet grows until it hits the configured size (~200 MB), then the tablet server splits it into two tablets and notifies the master. The master then updates the METADATA table (where tablet locations are stored) — this is one of the few master-mediated operations. Merge (combining two undersized tablets) is master-driven. A tablet server failure does not lose data because tablets are stored on GFS: the master detects the failure via Chubby session expiry (~tens of seconds), then reassigns the dead server's tablets to live servers, which simply load the tablet's SSTable from GFS. Recovery time is proportional to the number of memtable entries needing a log replay.
Memtable + SSTable: the read/write path in detail
Writes go to a write-ahead log (on GFS) and a sorted in-memory memtable. When the memtable fills (~few MB), it's frozen, written to GFS as an immutable SSTable, and a new memtable starts. Reads merge views across the memtable and all on-disk SSTables for a row — bloom filters per SSTable avoid reading SSTables that don't contain the key. Periodic minor compactions merge small SSTables; major compactions eliminate all tombstones and produce one clean SSTable per tablet. Without regular compaction, read amplification grows and deleted data continues consuming disk.
Chubby dependency: the hidden single point of failure
Bigtable uses Chubby (Google's distributed lock service) for: master election, tablet server liveness detection, bootstrapping the root tablet location, and ACL storage. This means Bigtable is unavailable if Chubby is unavailable — not just degraded, but completely unable to handle metadata operations. Chubby itself is a Paxos-replicated state machine designed for high availability, but this hard dependency is the architectural Achilles heel. HBase replaced this with ZooKeeper; Cassandra eliminated the dependency entirely via gossip and leaderless design, accepting weaker consistency in exchange.
Column families and garbage collection: the time dimension
Each cell in Bigtable is versioned — every (row, column) pair can hold multiple timestamped versions. Bigtable's garbage collection policy per column family specifies: keep the last N versions, or keep versions newer than T seconds. This is enforced lazily during compaction, not at write time — old versions accumulate until a compaction passes through. This makes Bigtable a natural time-series store for append-heavy workloads (sensor readings, log entries), but means a table with many versions and infrequent compaction will have high read amplification and surprising storage usage.
What breaks at scale
Tablet hotspots from bad key design are the #1 failure — one tablet server at 100% CPU while the rest sit idle, and no amount of horizontal scaling fixes it without a key redesign. Master overload during large-scale tablet reassignment (e.g., after a multi-server failure) can stall the cluster — the master is single-threaded for metadata operations in the original design. Compaction debt is the second failure mode: if write rate exceeds compaction throughput, SSTables accumulate, read latency spikes, and disk fills — production clusters need explicit compaction scheduling and SSTable count alerting. Finally, Chubby unavailability (even brief) causes tablet servers to lose their locks and self-terminate, cascading into a full cluster restart.
In production
Bigtable directly inspired HBase (the Hadoop ecosystem equivalent) and is the backing store for Google Search's web index, Google Analytics, Gmail, and Google Earth. The real engineering challenge is row key design: because Bigtable has no secondary indexes (HBase adds them with limited scope), every access pattern must be derivable from a primary key prefix scan or point lookup. A row key like `user:timestamp` supports per-user time-range scans; reversing the timestamp (`user:MAX_LONG-timestamp`) gives you newest-first without sorting. Hotspotting — all traffic landing on a single tablet because keys are sequential (e.g., monotonically increasing IDs) — is the most common production failure, and key salting or MD5 prefixing is the standard mitigation.
Common mistakes
- Random/monotonic row keys (hot tablets)
- Treating it as relational
- Ignoring tablet-location caching