System Design Library

Distributed Tracing (Jaeger)

Follow one request across dozens of services to find the slow/broken hop.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Trace context propagation
Spans per service
Assemble traces
Search/visualize

Non-functional

Low overhead
Useful sampling

Scale

Billions of spans/day

The approach

Each service creates spans with a shared trace ID propagated via headers; spans shipped async to a collector that assembles full traces; sampling (head or tail) controls volume; stored for search/flame-graph views.

Key components

Instrumented services → collector → trace store → UI

Numbers that matter

At 1,000 req/s, 100% trace capture generates ~50–200MB/s of span data — 1–5% head sampling is the standard default to stay under budget.
A typical microservice span contains ~30–50 fields (service name, operation, start time, duration, tags, logs); a 20-hop trace is ~1–2KB of structured data.
Jaeger's Cassandra backend handles ~100k spans/second per node; production deployments at Uber-scale require a tiered architecture with Kafka buffering.
Tail-based sampling buffers spans in memory for the full request duration — at P99 latency of 5s and 50k req/s, that's 250k in-flight traces needing ~250MB–1GB buffer per collector.

Senior deep-dive

Trace context propagation is the hardest part — the infrastructure can be perfect, but if one service in a 10-hop call doesn't forward the trace header, the trace is broken and you see two disconnected fragments.

Sampling is mandatory at scale, but head-based sampling loses the interesting traces (errors and slow outliers are rare) while tail-based sampling buffers entire traces in memory until completion, adding infrastructure cost.

The latency attribution problem is the core value: without tracing, you know a request took 2s; with tracing, you know 1.8s was a synchronous call to an upstream service waiting on a cold DB query.

Trace context propagation: the single point of failure

The entire distributed tracing contract rests on every service in the call graph propagating the `traceparent` / B3 header on every outbound call. One service that fails to forward (e.g., an async job that doesn't copy headers when it fans out) creates a trace split — two disconnected fragments that look like separate requests. Auto-instrumentation via agents (Java agent, OpenTelemetry auto-instrumentation) handles common frameworks, but custom in-house RPC frameworks require manual SDK integration. Testing for propagation gaps requires an integration test that verifies the full trace assembles — unit tests can't catch this.

Sampling strategies: head vs tail, and why it matters

Head-based sampling decides at the trace root (1% of requests get traced) — simple, low memory, but statistically unlikely to capture the rare 10s outlier or the 0.01% error. Tail-based sampling collects all spans and decides after the trace completes — captures every error and slow trace, but requires buffering the entire in-flight trace window in memory. Jaeger's adaptive sampling adjusts rates per operation to hit a target RPS of sampled traces. The operational reality: most teams start with head-based for cost control and layer tail-based sampling only for error traces via a priority sampling flag set by services on error.

Span design: what to tag and what not to

High-cardinality tags (user ID, session ID, request body content) create an index explosion in backends like Elasticsearch — each unique value is a new posting list entry. Jaeger and Zipkin warn about this but don't enforce limits. Low-cardinality operation names (`POST /api/checkout`, not `POST /api/checkout/user:12345`) are the rule. The useful tags are: status code, error boolean, DB statement (sanitized), external service name, and queue topic. Logs attached to spans (structured events within a span's timeline) are the right home for per-request detail, not span tags.

Storage and retention: the cost cliff

Traces are write-heavy, read-rarely — you write every sampled trace but only read during an incident or a performance investigation. This means cheap columnar or object storage with an index on trace ID and service name is the right backend. Jaeger supports Cassandra, Elasticsearch, and object storage (via Tempo). Hot-warm-cold tiering: keep 24h in Elasticsearch for fast incident response, 7–30 days in object storage (Parquet on S3) for post-incident analysis. Retaining full traces beyond 30 days is rarely justified — aggregate metrics (error rate, latency histograms) serve the long-term analytical need at 100× less cost.

Clock skew: the subtle correctness problem

Distributed spans report start/end times from their host clocks. NTP clock drift of 1–10ms is common and larger than many span durations, meaning a child span can appear to start before its parent — making waterfall views confusing. Google Dapper handles this by computing adjusted timestamps based on the network propagation delay measured from parent-to-child timestamps at collection time. Most open-source tools just display the raw times and add a "clock skew" warning. The practical fix is ensuring all hosts run PTP (Precision Time Protocol) or GPS-disciplined NTP, which limits drift to <1ms.

What breaks at scale

Collector overload during incident spikes is the primary failure: an error storm at 10× normal rate floods the trace collector with rejected spans because the backend can't ingest fast enough. Kafka as a buffer between SDK exporters and the collector backend is the standard mitigation — it decouples ingestion rate from storage rate. The second failure is trace incompleteness under load: when a service sheds load (circuit breaker, rate limit), it may drop spans rather than queue them, producing traces with missing hops. Span buffering with local disk spill in the SDK is the defense, though it adds complexity to the client library.

In production

Uber built Jaeger to handle the scale that Zipkin couldn't sustain for their polyglot microservices environment. Google's Dapper (the paper that started it all) used a 0.01% sampling rate for production traffic — aggressive because their scale was so enormous. Netflix uses Zipkin with adaptive sampling that increases sampling rate when error rates spike, catching the exact traces you need during an incident. The real challenge is heterogeneous infrastructure: when a trace spans a Go service, a Python Lambda, a Kafka consumer, and a Redis call, each leg needs its own instrumentation library and they all must agree on the header format (B3 vs W3C TraceContext). The OpenTelemetry project exists precisely to solve this fragmentation.

Common mistakes

Not propagating context (broken traces)
100% sampling (cost) or naive head sampling (misses rare errors)
Synchronous span export (latency)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →