Latency vs throughput
People say "make it fast" but mean two different things that you optimize in opposite ways.
Open the interactive version → diagrams, practice & moreThe problem
People say "make it fast" but mean two different things that you optimize in opposite ways.
The idea
Latency = how long one request takes. Throughput = how many requests per second you can handle. They are not the same.
How it works
Latency is per-request time, read in percentiles (p50/p99/p999) because the tail is what users feel. Throughput is requests/sec the system sustains. Little's Law ties them: concurrency = throughput × latency. Queueing theory adds the sting — as utilization climbs past ~70–80%, latency rises non-linearly toward infinity, so the last slice of capacity is exactly where tails explode.
The tradeoff
Batching and queuing raise throughput by amortizing fixed costs, but each request now waits for its batch — higher latency. Chasing ultra-low latency means running at low utilization (idle headroom), which wastes capacity. Pick per workload: a payment path optimizes p99 latency; a nightly ETL optimizes throughput.
In the wild
A bank transfer cares about latency; a nightly analytics job cares about throughput.
Interview deep dive
Flow
- Define the SLI: which percentile at which load matters.
- Measure under realistic concurrency, not single-threaded.
- Plot latency vs utilization and find the knee near 70–80%.
- Add capacity or shed load before the knee, not after.
Watch for
- Averages hide the tail; always report p99/p999.
- Past ~80% utilization, latency rises non-linearly.
- Coordinated omission makes naive load tests undercount the tail.
Interviewer trap
State the percentile and the load it holds at — "p99 < 100ms at 5k rps" beats "it's fast".