ChatGPT / LLM Serving
Serve LLM completions with streaming tokens, at scale, on scarce GPUs.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Chat/completions API
- Token streaming
- Conversation context
- Rate limits/quotas
Non-functional
- Low time-to-first-token
- GPU efficiency
- Graceful overload
Scale
GPU-bound; bursty
The approach
API gateway → request queue → GPU inference workers doing continuous/dynamic batching; stream tokens back over SSE/WS; KV-cache for context; autoscale GPU pool; quota & rate limits up front.
Key components
Gateway (auth/limit) → queue → batched GPU workers → token stream · context store
Numbers that matter
- Decode is serial — ~tens of ms per output token — so time-to-first-token + streaming drive perceived speed far more than total completion time.
- KV-cache memory (batch × context length), not FLOPs, usually caps concurrency — paged attention (vLLM) cuts fragmentation so more sequences fit per GPU.
- Continuous batching can multiply throughput several-fold over static batching by never letting a finished slot sit idle.
- Overload is inevitable on scarce GPUs — queue with admission control and shed/deprioritize with backpressure; blocking synchronously just collapses under burst.
Senior deep-dive
GPUs are the scarce resource, so batching is everything — the entire design optimizes GPU utilization.
Continuous batching packs many requests into each forward pass, admitting and evicting sequences mid-flight; streaming tokens hides the serial decode latency.
Under overload you queue and shed, never block — and a KV-cache keeps you from recomputing the context every token.
Continuous batching is the throughput unlock
Static batching waits for a full batch and leaves the GPU idle; continuous batching admits and evicts sequences mid-flight, refilling finished slots instantly. On scarce GPUs this is the single biggest win — it is why vLLM/TGI exist. Don't hand-roll static batching.
Stream tokens — decode is serial
Output tokens are generated one at a time, so total latency grows with response length. Stream each token over SSE/WebSocket so the user sees output immediately, and optimize time-to-first-token — perceived speed is set by TTFT, not total time.
The KV-cache is the capacity limit
Each active sequence holds a KV-cache that grows with context length — and it, not compute, usually caps how many requests fit on a GPU. Paged attention stores it in pages to cut fragmentation. Reuse the cache across a conversation's turns instead of recomputing context each call.
Overload control: queue and shed
GPUs can't autoscale in milliseconds, so bursts must queue with admission control; beyond a limit, shed or deprioritize and signal backpressure. Blocking synchronously under load collapses the whole tier — graceful degradation (slower, or "try again later") beats timeouts for everyone.
Quotas, routing, and cost
Rate-limit and quota at the gateway — GPU time is the budget — and route by difficulty: small/cheap models for easy requests, the big model for hard ones. Cache common prompt prefixes (system prompts) to skip recompute. Cost per token is a first-class design constraint.
What breaks at scale
The limits are GPU memory (KV-cache), queue depth under burst, and tail latency from long generations that hog a slot. Autoscale the GPU pool on queue depth (slowly), isolate long jobs from interactive traffic, and consider speculative decoding to cut latency. The scarce GPU dominates every decision.
In production
OpenAI, Anthropic, and self-hosters on vLLM / TGI / TensorRT-LLM all run this shape: a gateway with auth + quotas, a request queue, GPU workers doing continuous batching with a paged KV-cache, and SSE/WebSocket token streaming. The scarce, expensive GPU is the entire constraint — every choice serves utilization.
Common mistakes
- One request per GPU pass (no batching)
- Blocking instead of queueing under load
- Recomputing context every token (use KV-cache)