Log Aggregation (ELK)
Collect, index and search logs from thousands of services in realtime.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Ship logs
- Parse/enrich
- Full-text + structured search
- Retention/tiering
Non-functional
- High ingest
- Fast search
- Cost-controlled retention
Scale
TBs/day
The approach
Agents ship logs → a buffer (Kafka) → processors parse/enrich → an inverted-index store (Elasticsearch) for search; hot/warm/cold tiers move old data to cheap storage; sampling/quotas control cost.
Key components
Agents → Kafka buffer → processors → search index (hot/warm/cold)
Numbers that matter
- Elasticsearch's inverted index inflates raw log volume by 5–10×: a 1TB/day log stream needs 5–10TB of index storage before replicas.
- Filebeat can ship ~50,000 log lines/sec per agent before hitting CPU limits; at that rate a 1000-service fleet generates ~50M events/sec needing aggressive Kafka partitioning.
- Segment merging in Elasticsearch is tuned so a shard stays under 50GB — above that, merge amplification and JVM GC pauses make query latency unpredictable.
- Cold-tier storage on S3-backed Elasticsearch (frozen indices) reduces storage cost by 80–90% vs hot SSD nodes, at the cost of 10–30s query latency instead of <1s.
Senior deep-dive
Kafka as the buffer is the most important architectural decision — without it, a downstream Elasticsearch backpressure event causes log agents to OOM and you lose data mid-incident.
Indexing everything is the fast path to bankruptcy: full-text indexing costs ~10× the raw storage, so hot/warm/cold tiering with selective field indexing is mandatory at scale.
Search latency is dominated by segment count, not data volume — too many small segments (a common Logstash misconfiguration) makes queries scan thousands of files.
Agent-side buffering: lose nothing under backpressure
Filebeat uses a disk-backed spool so if Kafka is slow it buffers to local disk rather than dropping lines. Backpressure propagation is the failure mode: if the agent blocks on a full Kafka topic, it stops reading from the log file, and the app's log buffer fills, causing the app to either block or drop logs. The fix is async, fire-and-forget shipping with local disk spooling and an OOM-safe ring buffer.
Kafka partitioning: the throughput multiplier
Each Kafka partition is a sequential append log consumed by one Logstash/processor thread — more partitions means more parallelism into Elasticsearch. Partition by service or log stream, not by host, so a single chatty service doesn't starve others. Retention on the Kafka topic (e.g. 24–48h) is your disaster recovery window: if Elasticsearch falls over, replay from Kafka.
Parsing and enrichment: do it once at ingest
Grok patterns in Logstash parse raw text into structured fields; doing this at query time (Loki's approach) shifts cost to reads, which is painful for incident response. Enrichment — adding geo-IP, Kubernetes pod metadata, service-owner tags — should happen in the pipeline, not the UI, because enrichment data mutates over time and you can't retroactively re-enrich indexed documents.
Indexing strategy: full-text vs structured-only
Indexing every field as full-text (the Elasticsearch default) creates massive term dictionaries for high-cardinality fields like user IDs or UUIDs — these should be `keyword` type with doc_values only, not analyzed. A common optimization is dynamic mapping templates that route numeric and date fields to typed mappings and free-text fields to `text` with aggressive `index_options: docs` (skip term frequencies and positions if you only need presence, not ranking).
Tiering: hot/warm/cold/frozen
Hot nodes (NVMe SSD, high RAM) hold the last 7 days for <200ms search. Warm nodes (SATA SSD, less RAM) hold 7–30 days; shard allocation filtering moves indices automatically via ILM policies. Cold/frozen nodes serve indices directly from S3 via searchable snapshots, loading only queried segments — queries take seconds but storage cost drops 10×. The non-obvious trap: JVM heap pressure on nodes that host too many shards (>20 shards per GB of heap is the Elastic rule of thumb).
What breaks at scale
Split-brain in multi-node Elasticsearch (two masters elected due to network partition) corrupts the cluster state — `minimum_master_nodes = (N/2)+1` prevents it but requires an odd-sized master-eligible set. Index bloat from runaway cardinality (a developer adding a `user_query` field with raw search strings) causes shard sizes to explode overnight. GC pauses on the JVM at high heap utilization (>75%) cause nodes to drop out of the cluster transiently, triggering shard rebalancing storms that make things worse.
In production
The ELK stack (Elasticsearch + Logstash + Kibana) is the canonical pattern, but Logstash's JVM overhead led most large deployments to replace it with Fluentd or Filebeat for agent-side shipping. Kafka sits between agents and Elasticsearch to absorb spikes — this is the battle-tested Elastic recommendation for any >10GB/day deployment. Loki (Grafana) inverts the model: store raw compressed log chunks, index only labels, and push the query cost to read time; this is 10× cheaper to store but slow for ad-hoc field search. The real challenge at scale is index lifecycle management: automatically rolling over hot indices to warm (fewer replicas, HDDs) then to cold (S3) without dropping in-flight queries.
Common mistakes
- No ingest buffer (spikes drop logs)
- Indexing everything at full retention (cost)
- Grep over raw files instead of an index