System Design Library

Recommendation System (Netflix)

Recommend relevant items to each user from a huge catalog, fast.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Personalized ranked list
Candidate generation + ranking
Feedback loop

Non-functional

Low-latency serving
Freshness

Scale

Hundreds of millions of users

The approach

Two-stage: offline candidate generation (collaborative filtering/embeddings) narrows millions → hundreds; an online ranking model scores candidates per request; results cached; feedback retrains models.

Key components

Batch pipelines (features/embeddings) → candidate store · online ranker · feature store · cache

Numbers that matter

Netflix's catalog has ~15,000–17,000 titles globally — candidate generation typically narrows this to ~200–500 candidates per user; the ranking model then scores those to produce the final ~40–80 displayed items.
Collaborative filtering at Netflix scale (200M+ users × 17k items) requires a matrix factorization embedding of ~64–256 dimensions per user/item — the full embedding table is ~50–200 GB and must fit in-memory on serving nodes for fast ANN lookup.
The online ranking model inference budget is typically <20ms to fit within a <100ms total API response — this constrains model size to ~10M–100M parameters for synchronous serving; larger models run offline.
A/B testing a new recommendation algorithm across 1% of Netflix's user base (~2M users) for 2 weeks is the standard validation window — shorter windows miss weekly viewing cycle effects (people watch differently on weekends).

Senior deep-dive

The two-stage funnel (candidate generation → ranking) exists because you cannot run an expensive ranking model over your entire catalog — candidate generation is about recall at massive scale, ranking is about precision at human scale.

Collaborative filtering captures 'people like you also liked' but cold-starts badly on new users and items — hybrid systems blend collaborative signals with content-based features to handle both. The ranking model is where the real business logic lives — CTR prediction, diversity penalties, freshness boosts, and business rules (promoted content, legal restrictions) all get applied here, not in candidate generation.

Candidate generation: recall is the only metric that matters

At candidate generation, missing a relevant item is the cardinal sin — precision can be fixed in ranking. The goal is to recall the ~500 most relevant items out of millions with <10ms latency. This means ANN (Approximate Nearest Neighbor) search over pre-computed user embeddings — exact kNN over 200M items is impossible in real-time. Use HNSW or FAISS with product quantization; accept a 5–10% recall loss vs. exact search in exchange for 100× speed.

The cold-start problem: new users, new items

Collaborative filtering fails completely for new users (no watch history) and new items (no interaction data). For new users: fall back to popularity-based recommendations (top-N globally or by region) until you have enough signal (~3–5 interactions). For new items: use content-based features (genre, director, cast embeddings from NLP/image models) to find similar established items. The transition from content-based to collaborative is gradual — blend signals by confidence score (interaction count acts as a prior weight).

Feature store: the ranking model's nervous system

The ranking model scores 500 candidates per user using features like: user's last 5 watched genres, time since last watch, device type, time of day. These features must be <5ms to retrieve at serving time. A real-time feature store (Flink computes streaming aggregates → Redis/DynamoDB for low-latency reads) provides fresh user features. Stale features are a subtle bug — a model trained on fresh features but served stale features (e.g. last watch from yesterday) degrades recommendation quality in ways that are hard to diagnose.

Diversity, freshness, and business rules in the ranker

A pure ML ranking model maximizes predicted CTR — but CTR-optimal results can be a filter bubble (same genre forever) or dominated by clickbait thumbnails. Production rankers apply post-ML adjustments: diversity penalties (no more than 2 items from same genre in top-10), freshness boosts (new releases get a multiplier), and hard business rules (geo-restrictions, content ratings). These rules live in the ranking service as deterministic overrides applied after ML scoring — keeping them out of the model makes them auditable and fast to change.

Offline training → online serving: the pipeline contract

The ML model is trained offline on historical interaction data (watches, ratings, skips) via a batch pipeline (Spark on a data warehouse). The trained model artifact (embedding weights + ranking model weights) is published to a model registry and deployed to serving nodes. The critical contract: training features must exactly match serving features — if training uses 'watch count in last 7 days' but serving computes 'watch count in last 24 hours' due to a feature store difference, the model's predictions are systematically wrong (training-serving skew).

What breaks at scale

Embedding table memory explosion — as your catalog grows from 10k to 10M items, the item embedding table grows proportionally. At 256 dimensions × 4 bytes × 10M items = 10 GB just for item embeddings, which must be loaded into serving node RAM. Fix: use product quantization (compress embeddings from 256D float32 to 64 bytes with <5% recall loss) and lazy loading (shard the embedding table across nodes, each node only holds its assigned shard). Second failure: model staleness during viral events — a show gets a major press mention and everyone wants it; the batch-trained model from yesterday doesn't know this. Add a real-time trending signal as an explicit feature rather than waiting for next day's retraining.

In production

Netflix uses ALS (Alternating Least Squares) for batch collaborative filtering and two-tower neural networks for real-time candidate retrieval (one tower encodes the user, one encodes items; dot-product similarity enables ANN lookup). Spotify's Discover Weekly pipeline uses a combination of matrix factorization + audio content embeddings to handle new releases (items with no collaborative signal yet). YouTube's recommendation system (described in their 2016 paper) pioneered the two-stage deep neural network approach — a retrieval network (ANN over learned embeddings) followed by a ranking network with richer features. The real engineering challenge is feature freshness: the ranking model needs up-to-date user signals (what did they watch in the last hour?) from a real-time feature store (built on Flink + Redis at most companies), not batch-computed features from yesterday.

Common mistakes

Ranking the full catalog online
No feature store (training/serving skew)
Ignoring the feedback/retraining loop

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →