Typeahead / Autocomplete
Suggest top completions as the user types, in <100ms.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Prefix → top-k suggestions
- Ranked by popularity/recency
- Update from query stream
Non-functional
- <100ms
- Handles typos (optional)
Scale
Billions of queries; huge term space
The approach
Precomputed trie where each node stores its top-k completions; serve from memory/cache at the edge. A separate pipeline aggregates query logs and rebuilds/updates the trie offline.
Key components
Edge cache → trie service · log pipeline (queue → aggregation) rebuilds trie
Numbers that matter
- A production typeahead must respond in < 100ms end-to-end including network; this means in-process memory lookup must complete in < 5ms to leave budget for network and rendering
- Google processes ~8.5 billion searches/day; typeahead fires on every keystroke, so at 5 characters per query that's ~40B prefix lookups/day — only in-memory fan-out can handle this
- A trie with top-10 completions stored per node for a 1M-term vocabulary fits in roughly ~500MB-1GB of RAM — small enough to replicate to every edge node
- Update cadence: most production typeahead systems refresh the suggestion model every 1-6 hours via an offline batch pipeline over the past 7-30 days of query logs
Senior deep-dive
The data structure is precomputed, not queried — a live DB query on every keystroke at 100ms budget is impossible; the trie (or inverted prefix index) is built offline and served from memory.
Top-K storage at every node is the key optimization: instead of traversing the subtree on query, each trie node caches its top-k completions so a query is O(prefix length), not O(subtree size).
Freshness vs. consistency is the real tension — the offline pipeline aggregating query logs has multi-hour lag; handle trending terms (< 1 hour old) with a separate fast path (a small sorted set updated in near-realtime).
Trie vs. inverted prefix index: choose your structure
A character trie is intuitive but memory-inefficient — every shared prefix is stored once, but nodes have variable fan-out requiring pointer-heavy structures. A sorted prefix table (hash of prefix → top-k list) is flat, cache-friendly, and faster to query but uses more memory for non-shared prefixes. In practice, compressed tries (DAWG/FST — Finite State Transducers) give trie-like sharing with compact binary representation — Lucene uses FSTs extensively. Pick based on vocabulary size and update frequency.
Top-K at every node: the caching trick that makes it fast
Without pre-caching, a trie query for prefix 'app' must traverse the entire 'app' subtree to find completions — O(output size). With top-K stored at each node, the query returns immediately: walk the prefix, read the cached list. The cost: every node stores K completion strings (or IDs), multiplying memory usage by K. K=10 is the industry standard — beyond 10, UI can't show them anyway. Updates require refreshing top-K bottom-up after each pipeline run.
Ranking: frequency alone is wrong
Pure frequency ranking surfaces queries that were popular years ago but are now irrelevant. Production systems blend recency-weighted frequency (exponential decay over time), personalization signals (user's past queries, location, language), and business rules (demote harmful content, boost sponsored terms). The offline pipeline computes base scores; the online serving layer applies a lightweight re-ranking using the request context. Never ship pure frequency — it will surface embarrassing results.
Handling trending terms: the fast path
The offline pipeline has a multi-hour lag — a breaking news event won't appear in suggestions for hours. The fix is a parallel trending detection pipeline: a streaming job (Flink/Kafka Streams) counts query frequency in 5-minute windows, compares to baseline, and pushes anomalies (new top-N terms) into a small Redis sorted set at each edge node. The serving layer merges these trending candidates with the static trie results. This covers the 'Taylor Swift crashes Eras Tour' moment without rebuilding the full index.
Distribution: replicate the whole thing to the edge
A single central trie cluster can't serve 40B lookups/day under 5ms. The answer is full replication to every edge PoP — the trie is small enough (500MB-1GB) to fit in RAM on every gateway host. Periodic snapshot + atomic swap (build new trie, swap pointer, old trie GC'd) enables zero-downtime updates. Partial updates (patching individual nodes) are complex and error-prone; full snapshot replacement is operationally simpler and safe.
What breaks at scale
Offensive or harmful completions are the most visible production failure — a trending but inappropriate query surfaces in suggestions globally within hours of the pipeline run. You need a blocklist applied at serving time (not baked into the trie, so you can update it instantly) and a human review pipeline for flagged terms. The second failure mode: trie update latency during high traffic — swapping a 1GB in-memory structure on thousands of nodes simultaneously causes GC pauses and memory spikes; use rolling restarts across nodes and stagger snapshot delivery.
In production
Google's suggestion system uses a distributed trie served from in-memory caches at edge PoPs, refreshed multiple times per day. Bing and DuckDuckGo use similar offline-built prefix indexes. LinkedIn's typeahead for member names is different — it must handle a dynamic entity corpus (new members join constantly), so they maintain a near-realtime prefix index updated via a CDC pipeline from their member store. The real engineering challenge is personalization: a global top-K per prefix ignores user context; adding per-user or per-locale weighting means you can't pre-cache everything, requiring a two-stage approach (global candidates + online re-ranking).
Common mistakes
- Computing top-k per request (too slow)
- No debouncing on the client
- Ignoring personalization vs global trade-off