System Design Library

Distributed Log (Kafka)

A durable, partitioned, replicated append-only log delivering high-throughput streams.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Append to topic partitions
Ordered per partition
Consumer groups & offsets
Replication

Non-functional

Very high throughput
Durable
Replayable

Scale

Trillions of messages/day

The approach

Topics split into partitions (the unit of parallelism & ordering); each partition is an append-only log replicated to followers with a leader; consumers track offsets and form groups; sequential disk IO + zero-copy for throughput.

Key components

Producers → partitioned leaders (+ replicas) → consumer groups · controller (Raft/ZK)

Numbers that matter

A single Kafka broker sustains ~700 MB/s write throughput on spinning disks via sequential I/O and zero-copy sendfile — orders of magnitude above random-write workloads.
Replication lag under normal operation is <10 ms end-to-end; an ISR (in-sync replica) falling more than replica.lag.time.max.ms (default 30 s) behind is evicted from the ISR set.
A partition can hold trillions of messages — retention is purely time- or size-based, not record-count-based; message log segments are typically 1 GB before rolling.
End-to-end latency (producer → consumer) with `acks=1` and no batching is ~2–5 ms; with `acks=all` and ISR size 3, add ~5–15 ms for two round-trips to followers.

Senior deep-dive

The partition is the unit of everything — ordering, parallelism, throughput, and replication all live at the partition level, not the topic.

Sequential disk writes are the performance secret: Kafka never updates in place — it appends to a log segment and relies on the OS page cache, getting disk throughput close to RAM throughput on modern hardware.

Consumer offsets being consumer-owned (not broker-owned) is what makes replay, reprocessing, and independent consumer groups essentially free.

Partition = ordering guarantee, not topic

Kafka guarantees total order within a partition, nothing across partitions. This is the first thing interviewers gloss over. If you need global ordering (e.g., all events for one user ID in sequence), key-based partitioning (hash(key) % N) is mandatory — and you must never change partition count after the fact without a migration plan. Hot partitions (one user generating 80% of traffic) are a real production failure mode: custom partition strategies or salted keys are the fix.

Leader election and ISR: the durability contract

Every partition has one leader and N-1 followers in the ISR (in-sync replica) set. Producers writing `acks=all` wait for all ISR members to acknowledge — this is the actual durability guarantee. If a follower lags, it's removed from the ISR; the producer then only waits for the remaining ISR. Unclean leader election (allowing an out-of-ISR replica to become leader) trades durability for availability — a footgun you must explicitly opt into via `unclean.leader.election.enable=true`.

Zero-copy and page cache: why Kafka is fast

Kafka doesn't copy data into application memory on read. The OS `sendfile` syscall transfers bytes directly from page cache to the network socket without a kernel↔user-space round-trip. This means the JVM heap barely matters for throughput — the real memory that matters is OS page cache. In practice: size broker RAM so the hot (recent) partitions fit in page cache and consumers reading tail-of-log never touch disk.

Consumer groups: the pull model's hidden power

Consumers pull at their own pace and commit offsets themselves — to `__consumer_offsets` or externally (e.g., in a DB alongside the processed result for exactly-once semantics). This decouples producer throughput from consumer processing speed and makes replay free: reset an offset to reprocess from any point. The catch is rebalance storms — adding/removing consumers triggers a group-wide rebalance (stop-the-world for the group) which stalls processing for seconds; incremental cooperative rebalancing (introduced in 2.4) assigns only the moved partitions, avoiding full stops.

Exactly-once semantics: harder than it looks

Kafka's idempotent producer (enable.idempotence=true) deduplicates retries within a session using a sequence number per partition-producer pair. Transactional producers extend this to atomic multi-partition writes — useful for Kafka Streams exactly-once. But EOS only covers Kafka-to-Kafka; if you're writing to an external DB, you need your own idempotency (e.g., write the offset alongside the DB row in one transaction). At-least-once is far simpler and sufficient for most analytics workloads — pick EOS deliberately, not by default.

What breaks at scale

Partition count explosion is the silent killer — thousands of topics × many partitions × 3 replicas = millions of file handles and massive rebalance overhead. KRaft (ZooKeeper replacement) improves metadata throughput but doesn't eliminate the partition count problem. Consumer lag compounding is the second failure mode: a slow consumer falls behind, the data it needs ages out of page cache, reads hit disk, the consumer slows further — a death spiral requiring partition reassignment or lag alerting with auto-scaling. Finally, log compaction races with consumers: if a compaction run deletes an offset a consumer hasn't read yet, that consumer gets an `OffsetOutOfRange` error and must reset — a production surprise on key-compacted topics with slow consumers.

In production

Kafka was built at LinkedIn to replace point-to-point pipelines and now underpins streaming at Uber, Netflix, Confluent, and most large-scale event-driven architectures. The real engineering challenge is partition count tuning: too few partitions caps throughput and parallelism; too many inflates ZooKeeper/KRaft metadata, increases rebalance time, and multiplies file handles across the cluster. Compacted topics (Kafka's lesser-known feature) let it replace a changelog store — only the latest value per key is retained — which is how Kafka Connect sink connectors and Kafka Streams state stores do their change-data-capture and local state without a separate database.

Common mistakes

Expecting global ordering across partitions
Too few partitions (limits parallelism)
Auto-committing offsets before processing (loss)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →