Batch Processing (MapReduce/Spark)
Process petabytes across a cluster with automatic parallelism and fault tolerance.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Map/reduce or DAG of stages
- Data-parallel across nodes
- Fault tolerance
- Shuffle
Non-functional
- Scales to PB
- Survives worker failure
Scale
Thousands of nodes
The approach
Split input across mappers (move compute to data); shuffle/partition intermediate results by key to reducers; failed tasks re-run from lineage (Spark) or re-execute (MapReduce); a scheduler manages the DAG.
Key components
Scheduler → mappers (on data) → shuffle → reducers · lineage for recovery
Numbers that matter
- Hadoop HDFS block size is 128 MB by default; a 1 TB dataset maps to ~8,000 map tasks at that granularity.
- Spark can be 10–100x faster than MapReduce for iterative workloads (ML, graph) because RDDs keep hot data in memory across stages.
- Shuffle data can be 3–10x the input size for wide transformations like groupBy on skewed keys — this is the top cause of job OOM failures.
- Speculative execution re-launches the slowest 5% of tasks on idle nodes; stragglers routinely add 50–200% to job wall-time without it.
Senior deep-dive
Data locality is the core insight — moving compute to data is orders of magnitude cheaper than moving petabytes of data to compute nodes.
The shuffle is the only hard part: partitioning, sorting, and transferring intermediate data across the network is where most failures and slowdowns happen; Spark's in-memory shuffle cuts I/O but raises memory pressure.
Fault tolerance is free if you design for it — tasks are pure functions on immutable splits, so re-running a failed task from lineage costs almost nothing.
Splits and data locality: why you never move data
Each input split (typically one HDFS block) is scheduled on the node that holds a replica, so the mapper reads from local disk at ~500 MB/s instead of over the network at ~100 MB/s. YARN's resource scheduler ranks candidate nodes by rack proximity when the local node is busy. If you ignore locality — e.g., by using S3 as the input source — you trade away this advantage entirely and the network becomes the bottleneck.
The shuffle: where jobs go to die
After map, each output record is hash-partitioned by key and written to a local spill file; reducers pull their partition from all mapper nodes simultaneously. Shuffle bandwidth saturates the cluster switch during the merge phase — this is why Hadoop clusters over-provision 10GbE. In Spark, wide operations (shuffleMapStage → ResultStage) are the only place a stage boundary is forced; minimize them by co-locating joins on pre-partitioned tables.
Fault tolerance through lineage, not replication
MapReduce checkpoints to HDFS after every stage — expensive but trivially recoverable. Spark's RDD lineage graph lets a lost partition be recomputed from its parents without checkpointing, but long lineage chains (deep pipelines) mean a single node failure can trigger cascading recomputation. Checkpoint RDDs after expensive shuffles (iterative ML loops) to cap recovery cost.
Combiner / mini-reducer: the free win nobody uses enough
A combiner runs the reducer logic locally after the map phase, dramatically cutting shuffle bytes for aggregation jobs (word count, sum). The catch: combiners are only safe for commutative, associative operations — using them on an average (sum/count must stay separate) or a join is a correctness bug. In Spark, `reduceByKey` applies the same optimization automatically; `groupByKey` does not.
Skew: the 80/20 killer
When one key has 10M records and the average key has 100, the reducer handling it runs 100x slower and the whole job waits. Production fixes: salting (append a random suffix, aggregate in two passes), broadcast joins (ship the small table to every mapper so no shuffle happens), or adaptive query execution (Spark 3.x repartitions skewed partitions at runtime). Skew is always a data problem, rarely a framework problem.
What breaks at scale
Driver OOM is the most common Spark production failure: `collect()` or `count()` pulling results to the driver node exhausts heap on billion-row datasets — use distributed sinks. Shuffle service reliability breaks under 10,000+ concurrent tasks writing spill files; external shuffle service decouples this from executor lifetime. Speculative execution re-runs can double the load on an already-stressed cluster during peak hours, causing cascading slowdowns — tune the threshold carefully.
In production
Google's original MapReduce paper described jobs on 20,000+ commodity machines reprocessing petabytes of web crawl data per day. Apache Spark (used by LinkedIn, Alibaba, Netflix) replaced many Hadoop MR pipelines with DAG-based execution and in-memory caching, cutting iteration time from hours to minutes. The real engineering challenge is data skew: a single reducer inheriting all keys for a popular partition (e.g., a viral user's events) serializes the entire job — production shops add salting, pre-aggregation, or custom partitioners to break up hot keys.
Common mistakes
- Shipping data to compute (network bound)
- Skewed keys (one reducer overloaded)
- Ignoring shuffle as the bottleneck