Academy · Fundamentals

Latency vs throughput

People say "make it fast" but mean two different things that you optimize in opposite ways.

Open the interactive version → diagrams, practice & more

The problem

People say "make it fast" but mean two different things that you optimize in opposite ways.

The idea

Latency = how long one request takes. Throughput = how many requests per second you can handle. They are not the same.

How it works

Latency is per-request time, read in percentiles (p50/p99/p999) because the tail is what users feel. Throughput is requests/sec the system sustains. Little's Law ties them: concurrency = throughput × latency. Queueing theory adds the sting — as utilization climbs past ~70–80%, latency rises non-linearly toward infinity, so the last slice of capacity is exactly where tails explode.

The tradeoff

Batching and queuing raise throughput by amortizing fixed costs, but each request now waits for its batch — higher latency. Chasing ultra-low latency means running at low utilization (idle headroom), which wastes capacity. Pick per workload: a payment path optimizes p99 latency; a nightly ETL optimizes throughput.

In the wild

A bank transfer cares about latency; a nightly analytics job cares about throughput.

Interview deep dive

Flow

Define the SLI: which percentile at which load matters.
Measure under realistic concurrency, not single-threaded.
Plot latency vs utilization and find the knee near 70–80%.
Add capacity or shed load before the knee, not after.

Watch for

Averages hide the tail; always report p99/p999.
Past ~80% utilization, latency rises non-linearly.
Coordinated omission makes naive load tests undercount the tail.

Interviewer trap

State the percentile and the load it holds at — "p99 < 100ms at 5k rps" beats "it's fast".

Related Academy

Part of Academy on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →