Semantic Search
Search a large corpus by meaning, not just keywords — with exact-match recall preserved.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Semantic + keyword retrieval
- Metadata + ACL filters
- Rerank top results
- Relevance feedback loop
Non-functional
- Low, predictable p99 latency
- Tunable recall vs cost
- Multi-tenant isolation
Scale
10M+ chunks, hundreds–thousands QPS
The approach
Chunk documents (256–512 tokens, slight overlap) and index each chunk two ways: a dense embedding in an ANN index (HNSW/IVF) and sparse terms in BM25. A query embeds once, runs both retrievers, and fuses the two ranked lists with Reciprocal Rank Fusion (RRF) — which needs no score calibration. Apply metadata/ACL filters during retrieval, then rerank the top ~50–200 candidates with a cross-encoder for final precision. Tune in this order: chunking → hybrid weight → reranker. Measure recall@k and nDCG against a judged set, never by eyeballing.
Key components
Chunker · embedding model · ANN/vector index (HNSW) · BM25 index · RRF fusion · cross-encoder reranker · metadata/ACL filter · eval harness
Numbers that matter
- Embeddings are 768–1536 dims ≈ 3–6 KB each. 10M chunks ≈ 30–60 GB raw; an HNSW index adds ~1.5–2× for graph links — budget RAM, not just disk.
- HNSW returns ~95–98% recall at ~1–5 ms/query; a cross-encoder reranking 100 candidates adds ~20–80 ms on GPU — it usually dominates p99.
- Hybrid + RRF typically lifts recall@10 by ~5–15 points over dense-only on mixed keyword/semantic traffic; reranking is often the single biggest precision win.
- Rerank 50–200 candidates, not 1000 — quality plateaus fast while latency grows linearly. Cache query embeddings; queries repeat more than you think.
Senior deep-dive
Hybrid (dense + sparse) is the production default, not an optimization — pure dense search silently misses exact IDs, error codes, and rare tokens.
The rest is a precision-vs-latency budget: filter before ranking (post-filtering leaks across tenants), rerank only the top candidates (the biggest precision win, and your latency sink), and fix chunking before the prompt (it quietly decides recall).
None of it is tunable without a judged set — build one from clicks, then move one knob at a time against recall@k / nDCG.
Chunking is the hidden variable — fix it before the model
Fixed-size chunks split mid-sentence and mid-table; recursive splitting on structure (headings, paragraphs, code blocks) keeps ideas intact. "Late chunking" — embed the full document, then pool per chunk — lets each chunk inherit surrounding context and lifts recall on long docs. Overlap (~10–20%) trades index size for recall. When retrieval is bad, this is almost always the first thing to fix, not the prompt.
Hybrid is the default — but tune the fusion, not just the weights
RRF fuses the two ranked lists by rank, so it needs no score normalization and is robust across query types — the safe default. Weighted score fusion can beat it but is fragile unless you carefully normalize dense cosine and BM25 scores. The balance is traffic-dependent: lean sparse for codes, IDs, and jargon; lean dense for natural-language questions. Measure per query-type — an aggregate number hides that you're losing every exact-match query.
Reranking: why a cross-encoder pays for itself
A bi-encoder embeds query and document separately — fast, but blind to their interaction. A cross-encoder reads query + document together and scores relevance directly, catching matches dense retrieval ranks low. That is the biggest precision win, but it is O(candidates): rerank 50–200, batch on GPU, cache. When p99 hurts, distill to a smaller cross-encoder or use a late-interaction model (ColBERT) as a middle ground.
Query understanding is the cheapest big lever
Short, ambiguous, or keyword-dumped queries retrieve badly no matter how good the index is. Query rewriting (resolve pronouns, fold in context), multi-query (union several paraphrases), and HyDE (embed an LLM-drafted hypothetical answer instead of the raw question) all lift recall on hard queries for a small, often-skipped cost. Reach for this before a bigger embedding model.
You can't tune what you can't measure — build the judged set
Mine (query, clicked-result) pairs from logs, hand-label a few hundred into graded relevance, and hold out a test slice. recall@k asks "was the right doc retrieved at all?"; nDCG asks "was it ranked near the top?" — they fail differently, so track both, per segment. Then change one knob at a time; the moment you tune by eyeballing a few queries, you ship regressions you can't see.
What breaks at scale
Past ~100M chunks the ANN index dominates RAM and rebuild time. Shard by tenant or topic, switch HNSW → IVF+PQ to quantize vectors (4–32× smaller) and accept a recall hit you claw back with reranking. Freshness fights the immutable graph: buffer recent writes in a small flat index searched in parallel, and merge on a schedule. At this scale retrieval is a distributed-systems problem, not an ML one.
In production
This shape is near-universal. Glean and Perplexity both run hybrid retrieval followed by a reranking stage; Glean enforces per-user document ACLs during retrieval so results never cross permission boundaries. Elasticsearch/OpenSearch expose BM25 + dense vectors with built-in RRF in one query, and vector stores like Pinecone, Weaviate, and pgvector ship hybrid search with metadata filtering. The cross-cutting pattern: cheap broad recall (dense + sparse) → filter → expensive precise rerank on a small candidate set.
Common mistakes
- Dense-only → silently misses exact-match / rare-token queries
- Filtering after retrieval → short, empty, or cross-tenant results
- Reranking 1000+ candidates → p99 blows up for little gain
- Chunks too large → diluted embeddings and weak recall
- No judged set → tuning by vibes, regressions ship unnoticed