Observability: metrics, logs, traces
At scale you can't SSH in and look around. When something's slow or broken, how do you even find it?
Open the interactive version → diagrams, practice & moreThe problem
At scale you can't SSH in and look around. When something's slow or broken, how do you even find it?
The idea
Instrument everything across three pillars: metrics, logs, and distributed traces.
How it works
Metrics (numbers over time: latency, error rate, saturation) power dashboards and alerts. Logs capture discrete events. Traces follow one request across all services to find the slow hop. Define SLOs and an error budget to decide when to ship vs stabilize.
The tradeoff
Instrumentation costs overhead and storage; too many noisy alerts cause fatigue. Signal over noise is the art.
In the wild
Prometheus + Grafana (metrics), the ELK stack (logs), Jaeger/OpenTelemetry (traces).
Interview deep dive
Flow
- Define the user-facing SLI before picking tools.
- Emit RED or USE metrics from every critical service.
- Attach a trace ID at ingress and propagate it through calls.
- Alert on symptoms users feel, then inspect logs/traces for cause.
Watch for
- Dashboard count is not observability.
- A missing trace context breaks the request story.
- Alert fatigue hides real incidents.
Interviewer trap
Tie each signal to a decision: rollback, scale, page someone, or keep shipping.