System Design Library

ChatGPT / LLM Serving

Serve LLM completions with streaming tokens, at scale, on scarce GPUs.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Chat/completions API
  • Token streaming
  • Conversation context
  • Rate limits/quotas

Non-functional

  • Low time-to-first-token
  • GPU efficiency
  • Graceful overload

Scale

GPU-bound; bursty

The approach

API gateway → request queue → GPU inference workers doing continuous/dynamic batching; stream tokens back over SSE/WS; KV-cache for context; autoscale GPU pool; quota & rate limits up front.

Key components

Gateway (auth/limit) → queue → batched GPU workers → token stream · context store

Numbers that matter

Senior deep-dive

GPUs are the scarce resource, so batching is everything — the entire design optimizes GPU utilization.

Continuous batching packs many requests into each forward pass, admitting and evicting sequences mid-flight; streaming tokens hides the serial decode latency.

Under overload you queue and shed, never block — and a KV-cache keeps you from recomputing the context every token.

Continuous batching is the throughput unlock

Static batching waits for a full batch and leaves the GPU idle; continuous batching admits and evicts sequences mid-flight, refilling finished slots instantly. On scarce GPUs this is the single biggest win — it is why vLLM/TGI exist. Don't hand-roll static batching.

Stream tokens — decode is serial

Output tokens are generated one at a time, so total latency grows with response length. Stream each token over SSE/WebSocket so the user sees output immediately, and optimize time-to-first-token — perceived speed is set by TTFT, not total time.

The KV-cache is the capacity limit

Each active sequence holds a KV-cache that grows with context length — and it, not compute, usually caps how many requests fit on a GPU. Paged attention stores it in pages to cut fragmentation. Reuse the cache across a conversation's turns instead of recomputing context each call.

Overload control: queue and shed

GPUs can't autoscale in milliseconds, so bursts must queue with admission control; beyond a limit, shed or deprioritize and signal backpressure. Blocking synchronously under load collapses the whole tier — graceful degradation (slower, or "try again later") beats timeouts for everyone.

Quotas, routing, and cost

Rate-limit and quota at the gateway — GPU time is the budget — and route by difficulty: small/cheap models for easy requests, the big model for hard ones. Cache common prompt prefixes (system prompts) to skip recompute. Cost per token is a first-class design constraint.

What breaks at scale

The limits are GPU memory (KV-cache), queue depth under burst, and tail latency from long generations that hog a slot. Autoscale the GPU pool on queue depth (slowly), isolate long jobs from interactive traffic, and consider speculative decoding to cut latency. The scarce GPU dominates every decision.

In production

OpenAI, Anthropic, and self-hosters on vLLM / TGI / TensorRT-LLM all run this shape: a gateway with auth + quotas, a request queue, GPU workers doing continuous batching with a paged KV-cache, and SSE/WebSocket token streaming. The scarce, expensive GPU is the entire constraint — every choice serves utilization.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →