Agentic AI Systems

LLM Serving / Inference

Serve LLM completions to many users with high throughput, low latency, and bounded cost.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Stream token output
  • Batch concurrent requests
  • Prompt/KV caching
  • Model routing

Non-functional

  • High GPU utilization
  • Low time-to-first-token
  • Predictable tail latency

Scale

Thousands of concurrent streams

The approach

A gateway queues requests; an inference server uses continuous batching to pack many sequences onto the GPU, with a KV-cache so prior tokens aren't recomputed. Stream tokens to clients to cut perceived latency. Cache static prompt prefixes; route easy requests to smaller/cheaper models. Autoscale GPU workers on queue depth.

Key components

Gateway/queue · continuous-batching inference server · KV cache · prompt cache · model router · GPU autoscaler

Numbers that matter

Senior deep-dive

Output tokens are generated serially and dominate latency — stream them to hide it and optimize time-to-first-token.

Continuous batching, not static batching, is the throughput unlock: it admits and evicts sequences mid-flight to keep the GPU full.

KV-cache memory, not compute, is usually the capacity limit — and right-size the model per task instead of defaulting to the largest.

Decode is serial — that is where latency lives

Prefill (reading the prompt) is parallel and cheap; decode emits one token at a time (~tens of ms each), so it dominates wall-clock. Stream tokens so the user sees output immediately, and optimize time-to-first-token — perceived speed is set by TTFT far more than total completion time.

Continuous batching keeps the GPU full

Static batching waits for a whole batch — the GPU sits idle and head-of-line-blocked. Continuous batching admits and evicts sequences mid-flight, refilling finished slots instantly: the single biggest throughput win. It is why vLLM / TGI / TensorRT-LLM exist — don't hand-roll static batching.

The KV cache is the real capacity limit

Every active sequence holds a KV cache that grows with batch × sequence length — and it, not FLOPs, usually caps how many requests fit. Paged attention (vLLM) stores it in pages to cut fragmentation so you pack in more sequences. When you "run out of room," it is almost always KV memory.

Cache prefixes and route by difficulty

Static prompt prefixes (system prompts, few-shot) are identical across requests — cache them by hash and skip recompute. Route easy requests to smaller, cheaper models and reserve the big model for hard ones. Both cut cost without touching quality on the requests that don't need it.

Speculative decoding buys latency

A small draft model proposes several tokens; the big model verifies them in one pass — accepted tokens come nearly free, cutting latency when the draft is often right. It trades extra compute for speed; worth it for latency-sensitive, predictable workloads.

What breaks at scale

Thousands of concurrent streams make autoscaling on queue depth and predictable tail latency the hard problems — a few long generations can starve everyone. GPU memory (KV cache) is the scaling unit, so capacity-plan on concurrent sequence-length, not request count, and isolate long jobs from interactive traffic.

In production

vLLM (paged attention + continuous batching), TGI, TensorRT-LLM, and SGLang are the standard serving stacks. The techniques — continuous batching, paged KV cache, prefix caching, speculative decoding — are now table stakes for cost-efficient inference.

Common mistakes

Related Agentic AI Systems

Part of Agentic AI Systems on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →