Distributed File System (GFS/HDFS)
Store huge files across commodity machines, optimized for large sequential reads.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Large files split into chunks
- Replicated chunks
- Append-heavy
- High throughput
Non-functional
- Fault-tolerant on commodity HW
- Throughput > latency
Scale
PB; huge files
The approach
A master holds metadata (file→chunk mapping); chunkservers store fixed-size chunks (e.g. 64MB) replicated 3×; clients talk to the master once for location, then stream directly from chunkservers; master is kept lean.
Key components
Master (metadata) + chunkservers (data) · client streams from chunkservers
Numbers that matter
- GFS chunk size is 64 MB — chosen to reduce master metadata overhead and keep clients connected to the same chunkserver for many operations, amortizing TCP connection cost over large sequential reads.
- Each chunk is replicated to 3 chunkservers by default (cross-rack for failure independence); the master stores roughly 64 bytes of metadata per chunk — a 1 PB filesystem with 64 MB chunks needs ~1 GB of master RAM.
- In the original Google paper (2003), the cluster sustained ~750 MB/s read aggregate across ~100 clients on a single cluster — the bottleneck was network bisection bandwidth, not disk.
- Master reelection after failure (with a shadow master taking over) takes on the order of tens of seconds — during which metadata operations stall, though reads from cached chunk locations continue.
Senior deep-dive
The master is a deliberate single point of control, not a bottleneck: metadata fits in RAM (~100 bytes per chunk × billions of chunks = tens of GBs), and clients cache chunk locations, so the master is rarely on the hot path.
Large sequential reads (not random access) are the design center: 64 MB chunks amortize metadata overhead and enable streaming throughput — GFS was designed for MapReduce batch jobs, not OLTP.
Consistency is relaxed by design: GFS allows undefined (inconsistent) regions after concurrent writes and relies on applications (like MapReduce) to be tolerant — atomic record appends are the one consistency primitive GFS guarantees.
Metadata in RAM: why the master doesn't bottleneck reads
The GFS master holds all file-to-chunk mappings and chunk-to-chunkserver locations in memory. Because chunk size is 64 MB, even a 1 PB cluster has only ~16 million chunks — fitting comfortably in RAM. Clients cache chunk locations after the first master lookup, then communicate directly with chunkservers for all I/O. The master is only consulted on a cache miss or after a lease expires. This design means the master's bottleneck is metadata mutation rate (file creates, renames, chunk allocations), not read throughput — which aligns perfectly with a batch-read-heavy MapReduce workload.
Chunk leases: controlling write authority without coordination
For each chunk being written, the master grants a lease to one chunkserver (the primary). The primary decides the mutation order for concurrent writes and forwards to secondaries. Leases expire after 60 seconds and must be renewed — if the primary dies, the master simply waits for the lease to expire before granting a new one. This avoids split-brain writes without requiring a distributed consensus protocol on every write. The downside: a primary that is slow (not dead) can hold a lease and serialize writes through a degraded node for up to 60 seconds.
Atomic record append: the only strong consistency primitive
GFS's record append (append-at-least-once to a chunk, atomically, at GFS-chosen offset) is the primitive MapReduce relies on for multiple producers writing to the same output file. GFS guarantees the appended bytes appear contiguously in the file, though the same record may appear multiple times if a primary acknowledges after secondaries have appended but before the client receives the ack. Applications must either deduplicate by record checksum or tolerate duplicates — GFS does not handle this. This is a deliberate design choice: simpler system, more complexity pushed to the application.
Replication pipeline: network-efficient large writes
When a client writes a chunk, it pushes data pipelined across a chain of chunkservers (client → CS1 → CS2 → CS3) rather than fanout from the client. This uses each machine's full outbound bandwidth — the client saturates the first link, CS1 saturates the second, etc. The result is near-linear network utilization for large writes. The master's role is only to designate the primary; the client drives the pipeline from the chunk location it was given. Failure mid-pipeline causes a retry, which may result in a partially written chunk — the consistency model accounts for this with the defined/undefined distinction.
Shadow masters and recovery: availability despite single master
GFS has only one active master, but shadow masters replay the operation log and serve read-only metadata queries (with slight staleness) during master failure. The operation log (write-ahead log of all metadata mutations) is replicated to multiple machines — the master does not acknowledge a metadata change until the log entry is durable on several machines. After failure, a shadow promotes itself by replaying the log. Chunk-to-chunkserver mapping is not persisted by the master — it's rebuilt from chunkserver heartbeats on startup (each CS reports what chunks it holds). This simplifies recovery at the cost of a startup delay proportional to cluster size.
What breaks at scale
Billions of small files destroy the metadata model — each file needs at least one chunk, and a billion 1 KB files use as much master RAM as a billion 64 MB files while providing a tiny fraction of the storage. This is why Hadoop clusters with many small files run out of NameNode heap long before disk. Single master throughput (GFS's acknowledged limitation) caps metadata mutation rate — at Google scale, they moved to Colossus (GFS2) with a distributed metadata layer. Finally, chunkserver failure during a write leaves chunks with inconsistent replica counts — the master's background re-replication must finish before another failure, or a chunk can become under-replicated to zero (data loss). In a large cluster, this background work competes with foreground I/O for disk bandwidth.
In production
GFS was the inspiration for HDFS (Apache Hadoop), which powers data lakes at Facebook (Hive), Yahoo, and Cloudera/Databricks. The real engineering challenge is the master's single-threaded namespace lock: GFS serializes all metadata mutations through the master, which becomes a bottleneck at millions of files (not chunks — files with many small chunks are expensive). HDFS addressed this with HDFS Federation (multiple namenodes partitioning the namespace) and HDFS HA (standby namenodes with shared edit logs in ZooKeeper/QJM). The lesson from GFS → HDFS evolution: a RAM-sized metadata tier works until you have billions of small files, at which point the metadata model itself must be redesigned.
Common mistakes
- Routing data through the master
- Small chunks for a sequential workload
- Single master with no shadow/failover