System Design Library

Google Search

Index the web and answer queries in <200ms with ranked results.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Crawl & index
  • Query → ranked docs
  • Snippets

Non-functional

  • <200ms
  • Massive corpus
  • Fresh

Scale

Trillions of pages

The approach

Crawler → inverted index (term → posting list of docs), sharded by document; query fans out to all index shards in parallel, each returns top candidates, results merged & ranked (PageRank + signals); heavily cached.

Key components

Crawler → indexer → sharded inverted index · query aggregator → ranking · cache

Numbers that matter

Senior deep-dive

The inverted index is the data structure; the hard part is ranking — any system can build a posting list, but computing a relevance score that accounts for PageRank, query-term proximity, freshness, and personalization in <200ms across a trillion documents is where the work lives.

Scatter-gather over thousands of shards in parallel is the latency model: each shard returns its local top-k, a merger re-ranks the global top-k — the bottleneck is the slowest shard, so tail latency management (hedged requests, straggler mitigation) is structural, not optional.

Index freshness is a pipeline problem: the crawler → parser → indexer pipeline has multi-hour lag for most content; a real-time index layer (for breaking news, Twitter-style freshness) sits in front of the main index and is merged at query time.

Inverted index: the structure everyone knows, the scale nobody estimates

An inverted index maps term → posting list (doc IDs + positions + scores). At web scale, a single term like 'weather' has posting lists with billions of entries. These lists are compressed with variable-byte or PForDelta encoding and stored on SSDs; decompression at query time is the CPU bottleneck. Posting lists are sorted by doc ID to enable fast intersection (AND queries) via merge-join — this is why document ID assignment (e.g., by crawl cluster) matters for compression ratios.

Scatter-gather: tail latency is the design constraint

Querying 2,000 shards in parallel means the P99 of the slowest shard determines SERP latency. Google uses hedged requests — after a timeout threshold, the query is re-sent to a second replica of the lagging shard, and whichever responds first is used. Result merging must handle variable shard response times: set a strict deadline, use whatever results have arrived, and fill the rest with cached results or lower-quality signals. This is why Google search still returns 10 results even when shards are slow.

Ranking: the system behind the data structure

BM25 / TF-IDF gives you term-level relevance; PageRank gives you authority; neither alone gives quality. Production ranking is a multi-stage pipeline: index shards do cheap BM25 scoring to return top-1000 candidates; a learning-to-rank model (a GBDT or neural ranker) rescores the merged top-k with hundreds of features (anchor text, click-through rate, freshness, user location). The ranker runs in a single service post-merge — it's too expensive to run per shard.

Real-time index: the freshness layer

The base index has a crawl-to-index lag of hours to days for most pages. For breaking news, Twitter posts, and rapidly changing content, Google maintains a real-time index (sometimes called 'Caffeine layer') that ingests fresh crawl results in minutes. At query time, results from the real-time index are merged with the base index results and ranked together. The freshness signal itself is a ranking feature — boosting recently-crawled results for time-sensitive queries without surfacing them for evergreen queries.

Query understanding: what the query actually means

Literal keyword matching is a minority of queries. Query processing includes spell correction, synonym expansion, entity recognition, and intent classification before the query hits the index. 'Best pizza near me' is a local intent query — it routes to a geo-aware index shard, not the main web index. Query rewriting (expanding 'NYC' to 'New York City', adding synonyms) happens in a pre-processing layer; this is where significant search quality gains have come from in the last decade — the index is less important than understanding what the user meant.

What breaks at scale

Index shard imbalance is the operational nightmare: a shard serving a popular term prefix gets 10× the QPS of others. Rebalancing shards requires re-assigning document ID ranges, which invalidates posting list compression and requires re-indexing — a multi-day operation. The second failure mode: ranking model updates that degrade quality for a small query tail can be invisible in aggregate metrics; you need per-query-type quality monitoring (navigational vs. informational vs. transactional) to catch regressions. Canary serving with A/B on 0.1% of traffic before full rollout is the only safe deployment strategy.

In production

Google's serving stack is built on Bigtable for posting lists, Colossus (GFS successor) for raw storage, and a custom serving binary that runs the scatter-gather query fan-out. Bing uses a similar inverted-index + scoring pipeline with Azure infrastructure. The real engineering challenge is index serving consistency under continuous crawl updates: you can't take the index offline to update it, so Google uses a tiered index (a small real-time layer + a large base layer), merging results at query time — keeping those two layers in sync without query-time latency blowup is a continuous systems challenge.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →