Academy · Reliability & Observability

Observability: metrics, logs, traces

At scale you can't SSH in and look around. When something's slow or broken, how do you even find it?

Open the interactive version → diagrams, practice & more

The problem

At scale you can't SSH in and look around. When something's slow or broken, how do you even find it?

The idea

Instrument everything across three pillars: metrics, logs, and distributed traces.

How it works

Metrics (numbers over time: latency, error rate, saturation) power dashboards and alerts. Logs capture discrete events. Traces follow one request across all services to find the slow hop. Define SLOs and an error budget to decide when to ship vs stabilize.

The tradeoff

Instrumentation costs overhead and storage; too many noisy alerts cause fatigue. Signal over noise is the art.

In the wild

Prometheus + Grafana (metrics), the ELK stack (logs), Jaeger/OpenTelemetry (traces).

Interview deep dive

Flow

Define the user-facing SLI before picking tools.
Emit RED or USE metrics from every critical service.
Attach a trace ID at ingress and propagate it through calls.
Alert on symptoms users feel, then inspect logs/traces for cause.

Watch for

Dashboard count is not observability.
A missing trace context breaks the request story.
Alert fatigue hides real incidents.

Interviewer trap

Tie each signal to a decision: rollback, scale, page someone, or keep shipping.

Related Academy

Part of Academy on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →