Vector Database
Store and search billions of embeddings by similarity with low latency and tunable recall.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Upsert / delete vectors
- Top-k ANN search
- Metadata filtering
- Hybrid (dense + sparse)
Non-functional
- Tunable recall vs latency
- Horizontal scale
- Fresh inserts visible quickly
Scale
Billions of vectors, high QPS
The approach
Vectors sharded across nodes; each shard builds an ANN index (HNSW for recall, or IVF+PQ for memory/scale). Queries scatter to shards, gather top-k, and merge. Metadata filters apply pre-search to scope candidates and enforce tenancy. A separate sparse/BM25 index supports hybrid; results fuse via rank fusion.
Key components
Router · index shards (HNSW/IVF+PQ) · metadata filter · sparse index · merge/rank-fusion layer
Numbers that matter
- HNSW: raise ef_search for recall at the cost of latency — typically ~95–99% recall@10 at single-digit milliseconds.
- IVF+PQ quantization cuts memory ~4–32×, trading some recall — the lever once vectors no longer fit in RAM.
- 1B vectors × 768 dims × 4 bytes ≈ 3 TB raw — quantization and sharding stop being optional at that scale.
- Buffer recent inserts for flat/brute-force search until merged — HNSW updates are expensive, so freshness is a separate path.
Senior deep-dive
ANN is approximate by definition — the core dial is recall vs latency vs memory, not an exact lookup.
HNSW gives the best recall/latency but is RAM-heavy and costly to update; IVF+PQ scales memory and ingest at some recall cost.
Filter before ranking for correctness and tenancy — post-filtering returns short, empty, or leaky results.
Approximate, not exact — and that is the point
Exact nearest-neighbour over millions of vectors is too slow, so ANN trades a little recall for orders-of-magnitude speed. The knob is explicit: ef_search (HNSW) or nprobe (IVF) — higher means better recall and more latency. Treating ANN recall as exact is the classic mistake; measure it against a brute-force baseline.
HNSW vs IVF+PQ — the real decision
HNSW: best recall/latency, RAM-heavy, expensive to update — pick it when vectors fit in memory and updates are modest. IVF+PQ: quantizes vectors 4–32× smaller, scales to billions with faster ingest, at some recall cost you claw back with nprobe + reranking. It is a recall-vs-memory-vs-update-cost call, not a default.
Filter inside retrieval, never after
Apply tenant/metadata filters during the ANN/BM25 query (or via per-tenant namespaces). Post-filtering after fetching k can return near-zero results and, worse, risks leaking another tenant's vectors into the candidate set. This is both a relevance and a security boundary.
Fresh writes fight the immutable graph
HNSW graphs are expensive to mutate, so new vectors aren't instantly searchable. Buffer recent writes in a small flat (brute-force) index searched in parallel, then merge/rebuild on a schedule. Track index lag — "my new doc isn't found yet" is a merge-cadence issue, not a bug.
Sharding and the scatter-gather query
At scale, vectors shard across nodes; each shard searches its own index and a coordinator merges top-k. Recall now depends on per-shard ef and how results merge. One global index instead of per-tenant/shard scoping both bloats memory and risks leakage — scope deliberately.
What breaks at scale
1B vectors × 768 dims × 4 bytes ≈ 3 TB raw — quantization and sharding stop being optional. RAM (HNSW) or recall (PQ) becomes the ceiling, rebuild/compaction time grows, and tail latency is dominated by the slowest shard. Capacity-plan on memory and recall targets, not QPS alone.
In production
Pinecone, Weaviate, Qdrant, Milvus, and pgvector all expose HNSW (often IVF+PQ too) with metadata filtering; FAISS is the library many build on. The product differences are operational — sharding, filter correctness, freshness — not the core ANN math.
Common mistakes
- Filtering after ANN instead of before → wrong/leaky results
- Ignoring update cost of HNSW at scale
- One global index instead of per-tenant scoping
- Treating ANN recall as exact