System Design Library

Log Aggregation (ELK)

Collect, index and search logs from thousands of services in realtime.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Ship logs
Parse/enrich
Full-text + structured search
Retention/tiering

Non-functional

High ingest
Fast search
Cost-controlled retention

Scale

TBs/day

The approach

Agents ship logs → a buffer (Kafka) → processors parse/enrich → an inverted-index store (Elasticsearch) for search; hot/warm/cold tiers move old data to cheap storage; sampling/quotas control cost.

Key components

Agents → Kafka buffer → processors → search index (hot/warm/cold)

Numbers that matter

Elasticsearch's inverted index inflates raw log volume by 5–10×: a 1TB/day log stream needs 5–10TB of index storage before replicas.
Filebeat can ship ~50,000 log lines/sec per agent before hitting CPU limits; at that rate a 1000-service fleet generates ~50M events/sec needing aggressive Kafka partitioning.
Segment merging in Elasticsearch is tuned so a shard stays under 50GB — above that, merge amplification and JVM GC pauses make query latency unpredictable.
Cold-tier storage on S3-backed Elasticsearch (frozen indices) reduces storage cost by 80–90% vs hot SSD nodes, at the cost of 10–30s query latency instead of <1s.

Senior deep-dive

Kafka as the buffer is the most important architectural decision — without it, a downstream Elasticsearch backpressure event causes log agents to OOM and you lose data mid-incident.

Indexing everything is the fast path to bankruptcy: full-text indexing costs ~10× the raw storage, so hot/warm/cold tiering with selective field indexing is mandatory at scale.

Search latency is dominated by segment count, not data volume — too many small segments (a common Logstash misconfiguration) makes queries scan thousands of files.

Agent-side buffering: lose nothing under backpressure

Filebeat uses a disk-backed spool so if Kafka is slow it buffers to local disk rather than dropping lines. Backpressure propagation is the failure mode: if the agent blocks on a full Kafka topic, it stops reading from the log file, and the app's log buffer fills, causing the app to either block or drop logs. The fix is async, fire-and-forget shipping with local disk spooling and an OOM-safe ring buffer.

Kafka partitioning: the throughput multiplier

Each Kafka partition is a sequential append log consumed by one Logstash/processor thread — more partitions means more parallelism into Elasticsearch. Partition by service or log stream, not by host, so a single chatty service doesn't starve others. Retention on the Kafka topic (e.g. 24–48h) is your disaster recovery window: if Elasticsearch falls over, replay from Kafka.

Parsing and enrichment: do it once at ingest

Grok patterns in Logstash parse raw text into structured fields; doing this at query time (Loki's approach) shifts cost to reads, which is painful for incident response. Enrichment — adding geo-IP, Kubernetes pod metadata, service-owner tags — should happen in the pipeline, not the UI, because enrichment data mutates over time and you can't retroactively re-enrich indexed documents.

Indexing strategy: full-text vs structured-only

Indexing every field as full-text (the Elasticsearch default) creates massive term dictionaries for high-cardinality fields like user IDs or UUIDs — these should be `keyword` type with doc_values only, not analyzed. A common optimization is dynamic mapping templates that route numeric and date fields to typed mappings and free-text fields to `text` with aggressive `index_options: docs` (skip term frequencies and positions if you only need presence, not ranking).

Tiering: hot/warm/cold/frozen

Hot nodes (NVMe SSD, high RAM) hold the last 7 days for <200ms search. Warm nodes (SATA SSD, less RAM) hold 7–30 days; shard allocation filtering moves indices automatically via ILM policies. Cold/frozen nodes serve indices directly from S3 via searchable snapshots, loading only queried segments — queries take seconds but storage cost drops 10×. The non-obvious trap: JVM heap pressure on nodes that host too many shards (>20 shards per GB of heap is the Elastic rule of thumb).

What breaks at scale

Split-brain in multi-node Elasticsearch (two masters elected due to network partition) corrupts the cluster state — `minimum_master_nodes = (N/2)+1` prevents it but requires an odd-sized master-eligible set. Index bloat from runaway cardinality (a developer adding a `user_query` field with raw search strings) causes shard sizes to explode overnight. GC pauses on the JVM at high heap utilization (>75%) cause nodes to drop out of the cluster transiently, triggering shard rebalancing storms that make things worse.

In production

The ELK stack (Elasticsearch + Logstash + Kibana) is the canonical pattern, but Logstash's JVM overhead led most large deployments to replace it with Fluentd or Filebeat for agent-side shipping. Kafka sits between agents and Elasticsearch to absorb spikes — this is the battle-tested Elastic recommendation for any >10GB/day deployment. Loki (Grafana) inverts the model: store raw compressed log chunks, index only labels, and push the query cost to read time; this is 10× cheaper to store but slow for ad-hoc field search. The real challenge at scale is index lifecycle management: automatically rolling over hot indices to warm (fewer replicas, HDDs) then to cold (S3) without dropping in-flight queries.

Common mistakes

No ingest buffer (spikes drop logs)
Indexing everything at full retention (cost)
Grep over raw files instead of an index

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →