YouTube / Netflix (video)
Upload, store, transcode and stream video to millions globally on any device/network.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Upload
- Transcode to many bitrates
- Adaptive streaming
- Recommendations
- Views/likes
Non-functional
- Smooth global playback
- Massive storage & egress
Scale
Exabytes; billions of views
The approach
Upload → object store + queue → transcoding workers produce HLS/DASH renditions (multiple bitrates) → served via CDN near users. Metadata in DB + cache; never stream from origin.
Key components
Upload → object store + queue → transcode workers → CDN · metadata DB + cache · recsys
Numbers that matter
- One source video becomes ~5–10 renditions (resolution × bitrate), each split into 2–10s segments for adaptive switching — transcoding is the heavy, parallelizable upload-time cost.
- Egress, not storage, is the dominant cost — billions of views means CDN bandwidth dwarfs everything; the design exists to keep bytes off your origin.
- Never transcode or stream on read — transcode ahead of time, stream from the edge; a request just fetches pre-made segments from the nearest CDN node.
- HLS/DASH = segments + a manifest — the player reads the manifest, then pulls segments at whatever bitrate current bandwidth supports.
Senior deep-dive
Adaptive bitrate is the core UX win — the player switches rendition to match bandwidth, so playback never stalls.
That means pre-transcoding many versions (resolutions × bitrates) — heavy, async work done on upload, never on read.
Storage and egress dominate the bill — serve everything from a CDN near users; Netflix even pushes caches inside ISPs (Open Connect).
Adaptive bitrate streaming (HLS/DASH)
The video is split into short segments (2–10s) at multiple bitrates, described by a manifest. The player reads the manifest and picks each segment's bitrate from current bandwidth — dropping quality to avoid a stall, climbing back when the network recovers. This client-side switching is the whole playback UX.
Transcoding is heavy, async, and parallel
On upload a transcode farm produces every rendition (resolutions × codecs × bitrates) — CPU/GPU-intensive work that must be off the upload path (queue + workers). It is embarrassingly parallel: split the video, transcode segments concurrently, reassemble. Never make the uploader or viewer wait on it.
The CDN is the product at scale
With billions of views, egress bandwidth is the dominant cost and the latency lever. Pre-position popular content on CDN edges near users; never stream from origin. Netflix's Open Connect goes further — appliances inside ISPs — because the cheapest byte is the one served closest to the viewer.
Storage tiering for the long tail
A tiny fraction of videos get most views. Keep hot content on fast storage + CDN; tier cold content to cheaper storage with fewer renditions. You don't need every rendition of an unwatched video pre-made — generate cold renditions lazily on first demand.
Upload, metadata, and counts
Upload is a resumable, chunked transfer to object storage; metadata (title, owner) lives in a DB + cache. View and like counts are high-volume and approximate — aggregate them asynchronously rather than incrementing a row per view.
What breaks at scale
The hard parts are transcode throughput/cost, CDN egress economics, and storage for the rendition explosion. Tier storage by popularity, generate cold renditions lazily, push caching as close to viewers as possible. Recommendations and search are separate systems on top — the spine is transcode + CDN.
In production
YouTube and Netflix both run ingest → object storage → async transcode farm → HLS/DASH segments → global CDN. Netflix's Open Connect puts its own caching boxes inside ISP networks so popular titles are served meters from the viewer. The interesting engineering is the transcode pipeline and CDN economics, not the upload.
Common mistakes
- Streaming from origin (no CDN)
- Synchronous transcoding blocking upload
- One bitrate (breaks on mobile)