Metrics & Monitoring (Datadog)
Ingest billions of metric points, store time-series cheaply, alert in realtime.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Ingest metrics
- Time-series storage
- Queries/dashboards
- Alerting rules
Non-functional
- High write throughput
- Cheap long retention
- Fast range queries
Scale
Billions of points/min
The approach
Agents push metrics → ingestion → a time-series DB (downsampling + compression + rollups); a query layer for dashboards; an alerting engine evaluates rules against streams/recent data.
Key components
Agents → ingestion → TSDB (rollups) → query + alerting engine
Numbers that matter
- Prometheus with Gorilla compression achieves ~1.37 bytes per sample (down from 16 bytes raw), enabling months of retention on a single node for moderate cardinality.
- A single Datadog agent can ship ~10,000 metric series per host; at 10-second resolution that is 1k writes/sec into the aggregation tier before any rollup.
- High-cardinality time-series stores like Thanos or Cortex horizontally shard and can handle hundreds of millions of active series by splitting the TSDB into per-tenant blocks.
- Downsampling (5s → 1m → 5m → 1h) cuts long-term storage by 50–100× for a 90-day window at the cost of sub-minute query granularity.
Senior deep-dive
The write path must never block — every metric point must land in under 1ms or you lose data under load spikes.
Cardinality is the real enemy: a label with 10k unique values explodes a single metric into 10k series, each needing its own compressed block and index entry.
Alerting staleness kills trust — if an alert fires 5 minutes after the outage, on-call engineers stop trusting it; keep the evaluation loop under 30 seconds.
Ingestion: pull vs push is an ops philosophy
Pull (Prometheus) makes the monitoring system authoritative over what is scraped — easy to discover drift, but requires service discovery and fails for short-lived jobs. Push (Datadog, StatsD) works everywhere but lets the source control the flood. At scale, pre-aggregation at the agent (DogStatsD flush every 10s) is non-negotiable: raw per-request histograms would saturate any network.
Storage: compressed time-series blocks are the core
Gorilla-style delta-of-delta + XOR encoding exploits that consecutive timestamps differ by the same interval and values change slowly, hitting ~1.4 bytes/sample. Data is written into an in-memory WAL + active block, then sealed into immutable 2-hour blocks that compact over time. Compaction merges overlapping blocks and runs tombstone removal — skipping it causes read amplification, which is the most common Prometheus performance complaint.
Cardinality: the silent killer
Each unique label combination creates a separate time series with its own compressed stream and index entry. A metric like `http_requests{path=<user-specific>}` with 1M paths creates 1M series and blows the head block (always in-memory) into gigabytes. The fix is label value cardinality budgets enforced at ingest: reject or drop series that exceed per-metric limits before they land.
Alerting: evaluation freshness vs fan-out cost
Alert rules re-evaluate on a configurable interval (Prometheus default: 1m); for P0 SLOs you want 15–30s. But every rule runs a full PromQL query against the TSDB on every tick — a complex aggregation over 100k series can take seconds, creating alert evaluation lag that means you detect an outage minutes after it started. The fix: recording rules pre-compute expensive aggregations as new metric series so alert queries stay cheap.
Long-term storage: object store as the archive tier
Thanos Sidecar uploads sealed 2-hour TSDB blocks to S3/GCS in the background; a Thanos Query layer fans out reads across live Prometheus nodes and object store. Cortex/Mimir add a write path through a Dynamo-style ring for fully managed multi-tenant ingest. The key insight: object storage is the cheap durable tier — querying it is slow (several seconds for a week-range query) but acceptable for dashboards; real-time alerts still hit the local Prometheus.
What breaks at scale
Head block memory exhaustion is the most common production failure: high cardinality forces the in-memory index to grow until OOM kill, and Prometheus just dies. Thundering-herd scrapes at t=0 for thousands of targets spike the network and CPU — jitter the scrape_interval by hash(target). Alert flapping under brief spikes generates pager spam; use `for: 5m` pending windows and track burn rate over a long window rather than threshold crossings.
In production
Datadog runs a custom C++ aggregation agent that pre-aggregates at the host before shipping, slashing ingest volume. Prometheus uses a pull model (scrape targets every 15s) which is operationally simple but breaks for ephemeral jobs — hence Pushgateway. Thanos and Cortex bolt on object-store-backed long-term storage by uploading 2-hour TSDB blocks to S3, solving Prometheus's single-node retention limit. The real challenge is cardinality budgeting across teams: a badly labeled Kubernetes pod metric added by one service can bring down the entire TSDB.
Common mistakes
- Storing metrics in a relational DB
- No downsampling/rollups (storage blowup)
- Alert evaluation that scans raw history