LLM Serving / Inference
Serve LLM completions to many users with high throughput, low latency, and bounded cost.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Stream token output
- Batch concurrent requests
- Prompt/KV caching
- Model routing
Non-functional
- High GPU utilization
- Low time-to-first-token
- Predictable tail latency
Scale
Thousands of concurrent streams
The approach
A gateway queues requests; an inference server uses continuous batching to pack many sequences onto the GPU, with a KV-cache so prior tokens aren't recomputed. Stream tokens to clients to cut perceived latency. Cache static prompt prefixes; route easy requests to smaller/cheaper models. Autoscale GPU workers on queue depth.
Key components
Gateway/queue · continuous-batching inference server · KV cache · prompt cache · model router · GPU autoscaler
Numbers that matter
- Continuous batching keeps the GPU full vs static batching's idle gaps — the single biggest throughput win.
- KV-cache memory (batch × sequence length), not FLOPs, is often the capacity ceiling; paged attention (vLLM) cuts the waste.
- Decode is serial (~tens of ms per output token) — time-to-first-token + streaming dominate perceived latency.
- Cache static prompt prefixes by hash to skip recompute; route easy requests to smaller, cheaper models.
Senior deep-dive
Output tokens are generated serially and dominate latency — stream them to hide it and optimize time-to-first-token.
Continuous batching, not static batching, is the throughput unlock: it admits and evicts sequences mid-flight to keep the GPU full.
KV-cache memory, not compute, is usually the capacity limit — and right-size the model per task instead of defaulting to the largest.
Decode is serial — that is where latency lives
Prefill (reading the prompt) is parallel and cheap; decode emits one token at a time (~tens of ms each), so it dominates wall-clock. Stream tokens so the user sees output immediately, and optimize time-to-first-token — perceived speed is set by TTFT far more than total completion time.
Continuous batching keeps the GPU full
Static batching waits for a whole batch — the GPU sits idle and head-of-line-blocked. Continuous batching admits and evicts sequences mid-flight, refilling finished slots instantly: the single biggest throughput win. It is why vLLM / TGI / TensorRT-LLM exist — don't hand-roll static batching.
The KV cache is the real capacity limit
Every active sequence holds a KV cache that grows with batch × sequence length — and it, not FLOPs, usually caps how many requests fit. Paged attention (vLLM) stores it in pages to cut fragmentation so you pack in more sequences. When you "run out of room," it is almost always KV memory.
Cache prefixes and route by difficulty
Static prompt prefixes (system prompts, few-shot) are identical across requests — cache them by hash and skip recompute. Route easy requests to smaller, cheaper models and reserve the big model for hard ones. Both cut cost without touching quality on the requests that don't need it.
Speculative decoding buys latency
A small draft model proposes several tokens; the big model verifies them in one pass — accepted tokens come nearly free, cutting latency when the draft is often right. It trades extra compute for speed; worth it for latency-sensitive, predictable workloads.
What breaks at scale
Thousands of concurrent streams make autoscaling on queue depth and predictable tail latency the hard problems — a few long generations can starve everyone. GPU memory (KV cache) is the scaling unit, so capacity-plan on concurrent sequence-length, not request count, and isolate long jobs from interactive traffic.
In production
vLLM (paged attention + continuous batching), TGI, TensorRT-LLM, and SGLang are the standard serving stacks. The techniques — continuous batching, paged KV cache, prefix caching, speculative decoding — are now table stakes for cost-efficient inference.
Common mistakes
- Static batching → idle GPU and head-of-line blocking
- No streaming → bad perceived latency
- Recomputing the KV cache every token
- One huge model for every request