Build a CDN
Cache and serve content from edge locations worldwide with smart invalidation.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Cache at edge POPs
- Origin fetch on miss
- Invalidation/purge
- Geo-routing
Non-functional
- Low latency near users
- High hit ratio
- Fast purge
Scale
Global, petabytes
The approach
Many edge POPs cache content; users routed to the nearest via anycast/GeoDNS; cache-miss fetches from origin (or a mid-tier); invalidation via versioned URLs or a purge propagation system; consistent hashing within a POP.
Key components
GeoDNS/anycast → edge POPs (cache) → mid-tier → origin · purge system
Numbers that matter
- Cloudflare operates 300+ PoPs globally; anycast routes users to the nearest, typically achieving <50ms latency to a POP from anywhere in Europe or North America.
- A CDN cache-hit saves a typical origin RTT of 100–500ms and eliminates origin compute cost; major content platforms (Netflix, YouTube) achieve >95% cache hit ratios for popular content.
- TLS session resumption (via session tickets or 0-RTT) saves 1–2 RTTs on reconnects; without it every new connection to a CDN POP adds 100–300ms of TLS negotiation.
- CDN bandwidth costs are roughly 10–50× cheaper than egress from a cloud region's origin (~$0.01–0.05/GB vs $0.08–0.12/GB from AWS), making CDN economically mandatory for any media-heavy product.
Senior deep-dive
Cache-hit ratio at the edge is the only metric that matters — a CDN with 60% hit rate halves your origin load; 95% hit rate makes your origin nearly irrelevant for read traffic.
Routing to the nearest POP via anycast or GeoDNS is the first hop — the difference between a user hitting a POP 10ms away vs one 200ms away is felt on every uncached request and TLS handshake.
Invalidation is the hardest problem: purge-based invalidation propagates to hundreds of POPs in seconds but is operationally complex; versioned URLs (content-hashing the filename) make invalidation unnecessary at the cost of requiring deployment coordination.
Routing: anycast vs GeoDNS
Anycast announces the same IP from every POP via BGP — the internet's routing protocol naturally sends each user to the topologically nearest POP, and failover is automatic when a POP drops its route. GeoDNS returns different A records based on the resolver's IP — simpler to understand but requires a mapping database and reacts slowly (DNS TTL is 30–300s). Most modern CDNs use anycast for the edge IP and GeoDNS for fallback or multi-CDN routing.
Cache hierarchy: edge, mid-tier, origin
A two-tier hierarchy (edge POP → regional mid-tier → origin) dramatically improves hit rates for long-tail content: edge POPs are numerous and small, so individually their hit rates are low; a mid-tier shield (origin shield in Fastly/Cloudflare terms) consolidates misses before they reach origin, often lifting effective hit rate from 70% to 95%+. The non-obvious cost: mid-tier adds one hop of latency on misses — worth it only when the miss rate to origin is otherwise high.
Invalidation: the distributed state problem
Purge APIs (Fastly's Instant Purge, Cloudflare Cache Purge) propagate invalidations to all POPs in 150–500ms via a control-plane fan-out; at hundreds of POPs this requires a reliable broadcast mechanism. Surrogate keys (a header on cached objects listing logical tags) allow purging all objects tagged `product-123` without knowing their URLs. The safest pattern is immutable versioned URLs for static assets (JS/CSS/images) — no invalidation needed; new deployments just produce new URLs.
Origin shield: collapsing the thundering herd
When a popular object expires simultaneously, dozens of POP caches all send cache-miss requests to origin — the thundering herd or dog-pile problem. Request coalescing (also called request collapsing) at the POP holds duplicate concurrent misses and sends only one upstream, then fans the single response back to all waiters. This is table-stakes for any CDN serving dynamic or semi-dynamic content with short TTLs.
Edge compute: logic without origin roundtrips
Cloudflare Workers, Fastly Compute@Edge, Lambda@Edge allow running code in the CDN POP: rewriting URLs, injecting auth headers, A/B splitting, personalizing responses — all without a round-trip to origin. The key constraint: no persistent disk, minimal CPU budget (~5–50ms), and cold-start must be near-zero (Workers uses V8 isolates, not containers, for sub-millisecond cold starts). Moving logic to the edge reduces time-to-first-byte dramatically for personalization use cases.
What breaks at scale
Cache poisoning — an attacker causes a POP to cache a malformed or malicious response and serve it to all users — requires strict Vary header handling and cache key normalization (query string canonicalization). POP overload during a viral event: a single piece of content going viral in one geography can overwhelm a regional POP before the mid-tier absorbs the spike — CDNs handle this with load shedding, request queuing, and circuit breakers to origin. Inconsistent cache keys across POPs (one normalizes query params, another doesn't) cause the same URL to miss on some POPs and hit on others, making debugging nearly impossible.
In production
Akamai pioneered the distributed-edge model in 1999 and still runs 4000+ POPs using consistent hashing to route requests within a POP's cache cluster. Cloudflare's Workers extended the CDN into a compute layer — every POP can run JS/WASM logic, enabling cache logic, A/B testing, and auth at the edge without hitting origin. Netflix's Open Connect takes CDN ownership to the extreme: Netflix-operated appliances sit inside ISP networks, serving video with zero public-internet hops. The real challenge is cache consistency across hundreds of POPs after a content update — a multi-second propagation window means users on different POPs see different versions of a page.
Common mistakes
- Relying on purges instead of versioned URLs
- No mid-tier (origin overload on misses)
- Ignoring per-POP cache balancing