System Design Library

Live Streaming (Twitch)

Ingest a streamer's feed and broadcast to millions with low latency.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Ingest (RTMP)
Transcode to bitrates
Deliver (HLS/LL-HLS)
Live chat

Non-functional

Low glass-to-glass latency
Massive concurrent viewers

Scale

1 stream → millions of viewers

The approach

Streamer pushes via RTMP to ingest; transcoded to multiple bitrates in realtime; segmented and pushed to CDN edges; viewers pull adaptive segments; low-latency HLS shrinks delay.

Key components

RTMP ingest → realtime transcoder → packager → CDN → viewers

Numbers that matter

A popular Twitch stream at 1080p60 is encoded at roughly 6 Mbps; a concurrent viewership of 1 million requires the CDN to deliver ~6 Tbps for that single stream — only feasible via edge caching, not origin serving.
HLS segment duration is typically 2-6 seconds; with 4-second segments and a 3-segment buffer, standard HLS latency is ~12-18 seconds from stream capture to viewer display.
Twitch transcodes each stream to typically 4-6 quality tiers (1080p60, 720p60, 480p, 360p, 160p); at 100k concurrent streams, the transcoding fleet is one of the largest compute workloads in live video.
Low-Latency HLS uses partial segments (~200ms) to reduce glass-to-glass latency to 2-5 seconds, but requires CDN support for `EXT-X-PRELOAD-HINT` and HTTP/2 server push.

Senior deep-dive

Ingest and delivery are separate pipeline stages — the streamer pushes RTMP to a small number of ingest points; transcoding happens there; delivery to millions of viewers is a separate CDN pull problem where the CDN does the fan-out.

Adaptive bitrate (ABR) via HLS segments is the delivery mechanism — the stream is segmented into 2-6 second chunks at multiple bitrates; viewers' players choose the appropriate quality tier based on available bandwidth, making CDN caching straightforward (segments are immutable files).

Latency is a product decision, not a technical limit — standard HLS has 15-45 second latency; Low-Latency HLS (LL-HLS) reduces this to 2-5 seconds by pushing partial segments; the tradeoff is increased CDN complexity and fewer cache-friendly segments.

Ingest: from RTMP push to segment pipeline

Streamers push RTMP (or the newer SRT/WebRTC for lower latency) to the nearest ingest PoP. The ingest server remuxes (or transcodes) the stream into HLS segments at the configured output qualities. Each segment is written to a shared object store (or a fast local filesystem replicated to origin); the CDN origin then serves these segments on demand. The ingest server is the single point of failure for that stream — if it crashes, the streamer must reconnect, causing a visible disruption. Twitch uses redundant ingest paths for partnered streamers.

Transcoding fleet: the dominant compute cost

Transcoding 100k concurrent streams to 5 quality tiers each requires an enormous compute fleet. GPU-accelerated encoding (NVENC) reduces cost vs CPU x264 by 5-10× at equivalent quality. Each transcoding job is stateful — you can't split a single stream across multiple transcoders mid-stream because frames must be encoded in order with temporal dependencies (B-frames reference future frames). Fault tolerance requires detecting a transcoder failure and restarting the job on a new instance, which causes a ~5-10 second segment gap visible as buffering.

HLS delivery: immutable segments and CDN caching

Each HLS segment is an immutable file (e.g. `stream_1080p_00042.ts`) — once written, it never changes. This makes CDN caching trivial: cache-forever on the segment, short TTL on the playlist (`.m3u8`). The playlist lists the last N segments (sliding window); clients fetch the playlist every segment duration to discover new segments. Cache efficiency for a hot stream is near 100% — the CDN edge serves segments from memory without hitting origin after the first viewer on that edge fetched it.

Low-latency HLS: reducing the segment window

Standard HLS buffers 3+ complete segments for smooth playback — at 4s segments, that's 12+ seconds of latency. LL-HLS introduces partial segments (~200ms chunks) that are pushed before the full segment is complete, and playlist delta updates so clients only download what changed. The player buffers fewer and smaller chunks, reducing latency to 2-4 seconds. The CDN complexity increases significantly: partial segments aren't cacheable in the same way, and origin must support HTTP/2 push or long-polling for the preload hints.

Chat: intentionally decoupled from video

Twitch Chat uses WebSocket connections to a chat cluster separate from the video delivery infrastructure. Messages are broadcast to all subscribers in a channel via a pub/sub bus. At peak for a major event, a single channel's chat can exceed 100,000 messages per minute — Twitch rate-limits and samples the chat display (not every message is shown) to prevent the chat from becoming an unreadable scroll. The 10-30 second video delay means chat reactions to in-game events are inherently desynchronized from what viewers are watching.

What breaks at scale

Ingest failover for high-profile streams (esports finals, major streamers) is the highest-visibility failure — a 30-second stream interruption during a peak moment generates massive social media backlash. CDN thundering herd at segment boundaries: when a 500k-viewer stream's new segment becomes available, all viewers' players poll for it simultaneously, causing a spike of identical requests that can overwhelm a CDN PoP's origin-fetch concurrency. Transcoding fleet capacity during unexpected viral events (a celebrity goes live unexpectedly, concurrent streams spike 2×) — predictive capacity planning based on historical patterns fails for black-swan events, requiring rapid horizontal scaling of the transcoding fleet on minutes notice.

In production

Twitch uses Amazon CloudFront as the primary CDN with custom ingest infrastructure ("Video Ingest") distributed globally. The transcoding pipeline runs on GPU instances (NVENC hardware encoding is far cheaper per stream than x264 software at quality parity). The real challenge is tail latency for hot streams: a stream with 100k viewers hitting the same CDN edge PoP can overwhelm that PoP's origin fetch capacity during a segment boundary — CDN vendors use "thundering herd" mitigation (collapsing concurrent cache misses into a single origin fetch with request coalescing). Chat is deliberately separated from the video pipeline — chat runs on Twitch's own WebSocket infrastructure, and the 10-30 second video delay means chat is inherently desynchronized from stream action.

Common mistakes

Treating it like VOD (batch transcode)
No CDN for viewer fan-out
Ignoring the latency/cacheability tradeoff

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →