Academy

System Design Academy

39 system design concepts explained with problems, tradeoffs, production examples, and interview traps.

Caching & CDNs

Cache stampede & hot keysA popular cached key expires and suddenly thousands of requests all miss and hammer the DB at once.Cache-aside, write-through, write-backWhen exactly do you write to the cache vs the database? Get it wrong and you serve stale or lose data.CDNs & the edgeYour server is in Virginia; your user is in Tokyo. The speed of light alone adds 150ms+.TTL & evictionCache memory is finite and data goes stale. What do you keep, and for how long?Why cache at allReading the same data from a database a million times is wasteful and slow.

Consistency & Consensus

2PC, 3PC & the blocking problemMake N machines commit a transaction all-or-nothing, when any of them can crash mid-handshake.CAP & PACELCWhen the network splits a distributed system in two, you can't have everything. What do you give up?Consensus: Raft & PaxosMany machines must agree on one value/log order, surviving crashes and partitions — provably hard (FLP).Isolation levels & write skewRun transactions concurrently for speed and they corrupt each other in subtle ways.Quorums (R + W > N)With N replicas, how many must respond to a read/write so you never read stale data?

Databases & Replication

Connection poolingOpening a fresh DB connection per request is expensive, and databases cap how many they'll accept.IndexingFinding one row in a billion by scanning them all is hopelessly slow.Replication & read replicasOne database serving every read will eventually drown under read traffic.SQL vs NoSQLRelational databases are battle-tested but can be hard to scale horizontally. When do you reach for something else?

Fundamentals

Latency vs throughputPeople say "make it fast" but mean two different things that you optimize in opposite ways.StatelessnessIf a user's session lives in one server's memory, you can never safely add a second server.The request/response modelEverything online is one machine asking another for something. If you don't understand that round trip, nothing else makes sense.Vertical vs horizontal scalingYour one server is maxed out. Do you buy a bigger one, or buy more of them?

Messaging & Async

Backpressure & DLQsProducers can outpace consumers; the queue grows without bound and everything falls over.Idempotency & exactly-onceNetworks retry. The same "charge card" message can arrive twice. Now you've double-charged.Message queuesIf service A calls service B synchronously, B's slowness or outage becomes A's problem, and bursts overwhelm B.Pub/sub & fan-outOne event needs to reach many independent consumers without the producer knowing them all.

Partitioning & Sharding

Choosing a shard keyA bad shard key concentrates load on one shard and ruins the whole point.Consistent hashingWith naive hashing (key % N), changing the number of nodes reshuffles almost all keys — a disaster.Rebalancing & hot shardsData grows unevenly; one shard becomes a hotspot while others idle.Why shardA single database, however big, has a ceiling on storage, writes, and memory.

Real-World System Designs

Design a chat appReal-time 1:1 and group messaging with ordering and offline delivery, scaling to millions of live connections.Design a news feed (Twitter)Deliver each of 300M users a timeline of recent posts from everyone they follow, instantly — including celebrities with 100M followers.Design a URL shortenerMap billions of long URLs to short codes and redirect in under 50ms, with reads vastly outnumbering writes.Design ride dispatch (Uber)Match riders to nearby drivers in real time as millions of locations update every few seconds.Design video streaming (YouTube)Upload, store, transcode, and stream video to millions globally on every device and network.

Reliability & Observability

Multi-region & DRAn entire region can go down (power, fiber, natural disaster). One region = one outage from total failure.Observability: metrics, logs, tracesAt scale you can't SSH in and look around. When something's slow or broken, how do you even find it?Rate limiting & load sheddingA buggy client, a scraper, or a traffic spike can exhaust your capacity and take everyone down.Redundancy & failoverAnything can fail: a disk, a server, a rack, a whole region. A single copy of anything is a time bomb.

Scaling & Load Balancing

Autoscaling & capacityTraffic is spiky. Provision for the peak and you waste money; for the average and you fall over.Balancing algorithms"Spread the load" is vague. Spread it how?Load balancersYou have ten app servers. How does a user's request reach a healthy, not-overloaded one?The stateless app tierAdding servers only helps if they're interchangeable.

Part of SystemLore — browse the Academy, Library, Agentic AI systems, Glossary, and "X vs Y" comparisons. Open the interactive Academy.