System Design Library

Time-Series DB (Prometheus)

Store and query metrics efficiently with high cardinality.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Ingest series
Range queries
Downsampling
Labels/cardinality

Non-functional

Append-heavy efficiency
Cheap storage
Fast range scans

Scale

Millions of active series

The approach

Each series (metric+labels) is an append-only stream; samples compressed with delta-of-delta + XOR (Gorilla) for ~1-2 bytes/point; data in time-windowed blocks; a pull model scrapes targets; downsampling for retention.

Key components

Scraper (pull) → TSDB blocks (compressed) → query engine

Numbers that matter

Prometheus's Gorilla compression achieves ~1.37 bytes/sample vs 16 bytes raw (timestamp + float64), enabling ~700M samples per GB of storage.
A single Prometheus server can comfortably handle ~1–2 million active time series in its head block before memory pressure causes degraded query performance.
Default scrape interval is 15s; at 1M series and 15s interval that is ~66,000 samples/sec ingested — each scrape of 10,000 series is a synchronous HTTP GET that must complete in <10s or it fails.
Thanos compactor running against S3-backed blocks can reduce 5s resolution data to 5m resolution for data older than 40 days, achieving ~60× storage reduction for long-term metrics.

Senior deep-dive

The series identity problem is more important than compression: Prometheus stores each unique `{metric, label-set}` combination as a separate time series, and high-cardinality labels (user ID, request ID, pod IP) create millions of series that exhaust the in-memory head block long before storage fills.

Gorilla-style compression assumes smooth, predictable data: delta-of-delta for timestamps and XOR for values achieve ~1.37 bytes/sample only when the data is well-behaved — a noisy counter-reset or irregular scrape interval inflates encoding significantly.

The pull model makes the monitoring system authoritative — Prometheus knows what exists (via service discovery) and which targets are down (scrape failures appear as gaps, not just missing data), which is operationally invaluable but breaks for ephemeral or serverless jobs that live <15s.

Head block: the in-memory write path

Prometheus keeps the most recent 2 hours of data in the head block — entirely in memory with a WAL on disk for crash recovery. The head block maintains an inverted index (label value → series IDs) and a chunk for each active series (the compressed data). Every new unique label combination creates a new series entry in both the inverted index and the chunk store — cardinality is a memory multiplier, not just a storage concern. The head block is sealed every 2 hours and written as an immutable block to disk.

Gorilla encoding: temporal locality exploitation

Delta-of-delta encoding for timestamps: if samples arrive every 15s, the delta is always 15000ms, so the delta-of-delta is 0 and encodes in 1 bit. XOR encoding for values: consecutive float64 values in a metric often differ only in a few bits (e.g., a gauge that moves slowly), so XOR with the previous value has many leading/trailing zeros that compress to near-zero bits. Counter resets (when a process restarts) break both assumptions and encode at full cost — this is why counter resets are tracked separately in Prometheus's data model.

Pull model: the operational advantages (and limits)

Prometheus scrapes targets (HTTP GET `/metrics`) on a configurable interval via service discovery (Kubernetes SD, Consul SD, etc.). This means Prometheus knows every target that should exist and can detect missing targets (scrape failures) as first-class events — not just missing data. The fundamental limit: scrape targets must be reachable from the Prometheus server, which breaks in firewall-heavy environments and for short-lived jobs (Lambda, batch jobs) that complete before the scrape fires. The fix for short-lived jobs is Pushgateway, but it introduces a stateful component with its own failure modes.

TSDB blocks: immutable compaction lifecycle

After the head block is sealed, it becomes a 2-hour block on disk. The compactor periodically merges contiguous blocks: 2-hour → 6-hour → 24-hour → 7-day, reducing file count and improving query performance (fewer blocks to open per range query). Each block is a self-contained directory with chunks, an index, and metadata — this makes block-level upload to object storage (Thanos/Cortex) natural. The compactor also runs downsampling: creating 5m-resolution and 1h-resolution variants of old blocks for faster long-range queries.

Alerting and recording rules: PromQL at evaluation time

Recording rules pre-compute expensive PromQL expressions as new time series — e.g., `job:request_rate5m:rate` aggregated across all instances stored as a single series. This transforms a fan-out query across 1000 series into a single series lookup at alert evaluation time. Alert rules evaluate at a configurable interval; the `for` clause requires the condition to be true for N consecutive evaluations before firing — this is the pending state that prevents flapping on transient spikes. Misconfiguring `for: 0` with a short evaluation interval causes alert spam on every momentary spike.

What breaks at scale

Cardinality explosion from a single mislabeled metric is the most common production failure: a developer adds `label=request_url` to a counter, creating one series per URL (potentially millions), and the Prometheus head block grows from 2GB to 20GB overnight, triggering OOM. Slow scrapes — a target that takes 14s to respond to a 15s-interval scrape — cause scrape backlog and eventually scrape drops, creating holes in the metrics timeline exactly when the system is under stress and you most need the data. PromQL range queries over long windows (e.g., `rate(...[7d])`) on a single Prometheus server must load 7 days of block data into memory, causing query-induced OOM on under-resourced servers.

In production

Prometheus is the de facto standard for Kubernetes metrics: the kube-state-metrics and node-exporter exporters are installed in virtually every cluster. Thanos and Cortex/Mimir solve Prometheus's single-node storage limit by uploading 2-hour TSDB blocks to object storage. InfluxDB uses a custom TSM engine (Time-Structured Merge) rather than Gorilla, optimized for higher write throughput at the cost of more complex compression tuning. Victoria Metrics is a Prometheus-compatible TSDB that significantly outperforms Prometheus on ingestion throughput (~5–10×) by using a more aggressive LSM-based approach. The real production challenge is cardinality budgeting: one engineer adding a high-cardinality label to a shared Prometheus instance can make the entire server OOM within hours.

Common mistakes

Unbounded label cardinality (index blowup)
General DB for time-series
No downsampling for long retention

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →