System Design Library

Build a CDN

Cache and serve content from edge locations worldwide with smart invalidation.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Cache at edge POPs
  • Origin fetch on miss
  • Invalidation/purge
  • Geo-routing

Non-functional

  • Low latency near users
  • High hit ratio
  • Fast purge

Scale

Global, petabytes

The approach

Many edge POPs cache content; users routed to the nearest via anycast/GeoDNS; cache-miss fetches from origin (or a mid-tier); invalidation via versioned URLs or a purge propagation system; consistent hashing within a POP.

Key components

GeoDNS/anycast → edge POPs (cache) → mid-tier → origin · purge system

Numbers that matter

Senior deep-dive

Cache-hit ratio at the edge is the only metric that matters — a CDN with 60% hit rate halves your origin load; 95% hit rate makes your origin nearly irrelevant for read traffic.

Routing to the nearest POP via anycast or GeoDNS is the first hop — the difference between a user hitting a POP 10ms away vs one 200ms away is felt on every uncached request and TLS handshake.

Invalidation is the hardest problem: purge-based invalidation propagates to hundreds of POPs in seconds but is operationally complex; versioned URLs (content-hashing the filename) make invalidation unnecessary at the cost of requiring deployment coordination.

Routing: anycast vs GeoDNS

Anycast announces the same IP from every POP via BGP — the internet's routing protocol naturally sends each user to the topologically nearest POP, and failover is automatic when a POP drops its route. GeoDNS returns different A records based on the resolver's IP — simpler to understand but requires a mapping database and reacts slowly (DNS TTL is 30–300s). Most modern CDNs use anycast for the edge IP and GeoDNS for fallback or multi-CDN routing.

Cache hierarchy: edge, mid-tier, origin

A two-tier hierarchy (edge POP → regional mid-tier → origin) dramatically improves hit rates for long-tail content: edge POPs are numerous and small, so individually their hit rates are low; a mid-tier shield (origin shield in Fastly/Cloudflare terms) consolidates misses before they reach origin, often lifting effective hit rate from 70% to 95%+. The non-obvious cost: mid-tier adds one hop of latency on misses — worth it only when the miss rate to origin is otherwise high.

Invalidation: the distributed state problem

Purge APIs (Fastly's Instant Purge, Cloudflare Cache Purge) propagate invalidations to all POPs in 150–500ms via a control-plane fan-out; at hundreds of POPs this requires a reliable broadcast mechanism. Surrogate keys (a header on cached objects listing logical tags) allow purging all objects tagged `product-123` without knowing their URLs. The safest pattern is immutable versioned URLs for static assets (JS/CSS/images) — no invalidation needed; new deployments just produce new URLs.

Origin shield: collapsing the thundering herd

When a popular object expires simultaneously, dozens of POP caches all send cache-miss requests to origin — the thundering herd or dog-pile problem. Request coalescing (also called request collapsing) at the POP holds duplicate concurrent misses and sends only one upstream, then fans the single response back to all waiters. This is table-stakes for any CDN serving dynamic or semi-dynamic content with short TTLs.

Edge compute: logic without origin roundtrips

Cloudflare Workers, Fastly Compute@Edge, Lambda@Edge allow running code in the CDN POP: rewriting URLs, injecting auth headers, A/B splitting, personalizing responses — all without a round-trip to origin. The key constraint: no persistent disk, minimal CPU budget (~5–50ms), and cold-start must be near-zero (Workers uses V8 isolates, not containers, for sub-millisecond cold starts). Moving logic to the edge reduces time-to-first-byte dramatically for personalization use cases.

What breaks at scale

Cache poisoning — an attacker causes a POP to cache a malformed or malicious response and serve it to all users — requires strict Vary header handling and cache key normalization (query string canonicalization). POP overload during a viral event: a single piece of content going viral in one geography can overwhelm a regional POP before the mid-tier absorbs the spike — CDNs handle this with load shedding, request queuing, and circuit breakers to origin. Inconsistent cache keys across POPs (one normalizes query params, another doesn't) cause the same URL to miss on some POPs and hit on others, making debugging nearly impossible.

In production

Akamai pioneered the distributed-edge model in 1999 and still runs 4000+ POPs using consistent hashing to route requests within a POP's cache cluster. Cloudflare's Workers extended the CDN into a compute layer — every POP can run JS/WASM logic, enabling cache logic, A/B testing, and auth at the edge without hitting origin. Netflix's Open Connect takes CDN ownership to the extreme: Netflix-operated appliances sit inside ISP networks, serving video with zero public-internet hops. The real challenge is cache consistency across hundreds of POPs after a content update — a multi-second propagation window means users on different POPs see different versions of a page.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →