System Design Library

Log Aggregation (ELK)

Collect, index and search logs from thousands of services in realtime.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Ship logs
  • Parse/enrich
  • Full-text + structured search
  • Retention/tiering

Non-functional

  • High ingest
  • Fast search
  • Cost-controlled retention

Scale

TBs/day

The approach

Agents ship logs → a buffer (Kafka) → processors parse/enrich → an inverted-index store (Elasticsearch) for search; hot/warm/cold tiers move old data to cheap storage; sampling/quotas control cost.

Key components

Agents → Kafka buffer → processors → search index (hot/warm/cold)

Numbers that matter

Senior deep-dive

Kafka as the buffer is the most important architectural decision — without it, a downstream Elasticsearch backpressure event causes log agents to OOM and you lose data mid-incident.

Indexing everything is the fast path to bankruptcy: full-text indexing costs ~10× the raw storage, so hot/warm/cold tiering with selective field indexing is mandatory at scale.

Search latency is dominated by segment count, not data volume — too many small segments (a common Logstash misconfiguration) makes queries scan thousands of files.

Agent-side buffering: lose nothing under backpressure

Filebeat uses a disk-backed spool so if Kafka is slow it buffers to local disk rather than dropping lines. Backpressure propagation is the failure mode: if the agent blocks on a full Kafka topic, it stops reading from the log file, and the app's log buffer fills, causing the app to either block or drop logs. The fix is async, fire-and-forget shipping with local disk spooling and an OOM-safe ring buffer.

Kafka partitioning: the throughput multiplier

Each Kafka partition is a sequential append log consumed by one Logstash/processor thread — more partitions means more parallelism into Elasticsearch. Partition by service or log stream, not by host, so a single chatty service doesn't starve others. Retention on the Kafka topic (e.g. 24–48h) is your disaster recovery window: if Elasticsearch falls over, replay from Kafka.

Parsing and enrichment: do it once at ingest

Grok patterns in Logstash parse raw text into structured fields; doing this at query time (Loki's approach) shifts cost to reads, which is painful for incident response. Enrichment — adding geo-IP, Kubernetes pod metadata, service-owner tags — should happen in the pipeline, not the UI, because enrichment data mutates over time and you can't retroactively re-enrich indexed documents.

Indexing strategy: full-text vs structured-only

Indexing every field as full-text (the Elasticsearch default) creates massive term dictionaries for high-cardinality fields like user IDs or UUIDs — these should be `keyword` type with doc_values only, not analyzed. A common optimization is dynamic mapping templates that route numeric and date fields to typed mappings and free-text fields to `text` with aggressive `index_options: docs` (skip term frequencies and positions if you only need presence, not ranking).

Tiering: hot/warm/cold/frozen

Hot nodes (NVMe SSD, high RAM) hold the last 7 days for <200ms search. Warm nodes (SATA SSD, less RAM) hold 7–30 days; shard allocation filtering moves indices automatically via ILM policies. Cold/frozen nodes serve indices directly from S3 via searchable snapshots, loading only queried segments — queries take seconds but storage cost drops 10×. The non-obvious trap: JVM heap pressure on nodes that host too many shards (>20 shards per GB of heap is the Elastic rule of thumb).

What breaks at scale

Split-brain in multi-node Elasticsearch (two masters elected due to network partition) corrupts the cluster state — `minimum_master_nodes = (N/2)+1` prevents it but requires an odd-sized master-eligible set. Index bloat from runaway cardinality (a developer adding a `user_query` field with raw search strings) causes shard sizes to explode overnight. GC pauses on the JVM at high heap utilization (>75%) cause nodes to drop out of the cluster transiently, triggering shard rebalancing storms that make things worse.

In production

The ELK stack (Elasticsearch + Logstash + Kibana) is the canonical pattern, but Logstash's JVM overhead led most large deployments to replace it with Fluentd or Filebeat for agent-side shipping. Kafka sits between agents and Elasticsearch to absorb spikes — this is the battle-tested Elastic recommendation for any >10GB/day deployment. Loki (Grafana) inverts the model: store raw compressed log chunks, index only labels, and push the query cost to read time; this is 10× cheaper to store but slow for ad-hoc field search. The real challenge at scale is index lifecycle management: automatically rolling over hot indices to warm (fewer replicas, HDDs) then to cold (S3) without dropping in-flight queries.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →