System Design Library

Metrics & Monitoring (Datadog)

Ingest billions of metric points, store time-series cheaply, alert in realtime.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Ingest metrics
  • Time-series storage
  • Queries/dashboards
  • Alerting rules

Non-functional

  • High write throughput
  • Cheap long retention
  • Fast range queries

Scale

Billions of points/min

The approach

Agents push metrics → ingestion → a time-series DB (downsampling + compression + rollups); a query layer for dashboards; an alerting engine evaluates rules against streams/recent data.

Key components

Agents → ingestion → TSDB (rollups) → query + alerting engine

Numbers that matter

Senior deep-dive

The write path must never block — every metric point must land in under 1ms or you lose data under load spikes.

Cardinality is the real enemy: a label with 10k unique values explodes a single metric into 10k series, each needing its own compressed block and index entry.

Alerting staleness kills trust — if an alert fires 5 minutes after the outage, on-call engineers stop trusting it; keep the evaluation loop under 30 seconds.

Ingestion: pull vs push is an ops philosophy

Pull (Prometheus) makes the monitoring system authoritative over what is scraped — easy to discover drift, but requires service discovery and fails for short-lived jobs. Push (Datadog, StatsD) works everywhere but lets the source control the flood. At scale, pre-aggregation at the agent (DogStatsD flush every 10s) is non-negotiable: raw per-request histograms would saturate any network.

Storage: compressed time-series blocks are the core

Gorilla-style delta-of-delta + XOR encoding exploits that consecutive timestamps differ by the same interval and values change slowly, hitting ~1.4 bytes/sample. Data is written into an in-memory WAL + active block, then sealed into immutable 2-hour blocks that compact over time. Compaction merges overlapping blocks and runs tombstone removal — skipping it causes read amplification, which is the most common Prometheus performance complaint.

Cardinality: the silent killer

Each unique label combination creates a separate time series with its own compressed stream and index entry. A metric like `http_requests{path=<user-specific>}` with 1M paths creates 1M series and blows the head block (always in-memory) into gigabytes. The fix is label value cardinality budgets enforced at ingest: reject or drop series that exceed per-metric limits before they land.

Alerting: evaluation freshness vs fan-out cost

Alert rules re-evaluate on a configurable interval (Prometheus default: 1m); for P0 SLOs you want 15–30s. But every rule runs a full PromQL query against the TSDB on every tick — a complex aggregation over 100k series can take seconds, creating alert evaluation lag that means you detect an outage minutes after it started. The fix: recording rules pre-compute expensive aggregations as new metric series so alert queries stay cheap.

Long-term storage: object store as the archive tier

Thanos Sidecar uploads sealed 2-hour TSDB blocks to S3/GCS in the background; a Thanos Query layer fans out reads across live Prometheus nodes and object store. Cortex/Mimir add a write path through a Dynamo-style ring for fully managed multi-tenant ingest. The key insight: object storage is the cheap durable tier — querying it is slow (several seconds for a week-range query) but acceptable for dashboards; real-time alerts still hit the local Prometheus.

What breaks at scale

Head block memory exhaustion is the most common production failure: high cardinality forces the in-memory index to grow until OOM kill, and Prometheus just dies. Thundering-herd scrapes at t=0 for thousands of targets spike the network and CPU — jitter the scrape_interval by hash(target). Alert flapping under brief spikes generates pager spam; use `for: 5m` pending windows and track burn rate over a long window rather than threshold crossings.

In production

Datadog runs a custom C++ aggregation agent that pre-aggregates at the host before shipping, slashing ingest volume. Prometheus uses a pull model (scrape targets every 15s) which is operationally simple but breaks for ephemeral jobs — hence Pushgateway. Thanos and Cortex bolt on object-store-backed long-term storage by uploading 2-hour TSDB blocks to S3, solving Prometheus's single-node retention limit. The real challenge is cardinality budgeting across teams: a badly labeled Kubernetes pod metric added by one service can bring down the entire TSDB.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →