Vector Database
Find nearest-neighbor vectors among billions for semantic search/RAG.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Insert embeddings
- k-NN similarity search
- Filter + search
- Updates
Non-functional
- Low-latency ANN
- Scales to billions
- High recall
Scale
Billions of vectors
The approach
Approximate nearest neighbor (ANN) indexes — HNSW (navigable small-world graphs) or IVF/PQ — trade a little recall for huge speed vs exact search; sharded across nodes; quantization compresses vectors; metadata filtering combined with vector search.
Key components
Embedder → ANN index (HNSW/IVF, sharded) → query
Numbers that matter
- HNSW at 96 dimensions achieves ~99% recall at ~1ms query latency; scaling to 1536 dims (OpenAI ada-002) pushes memory per vector to ~6KB uncompressed.
- Product quantization (PQ with m=96 subspaces) compresses 1536-dim float32 to ~96 bytes — a 64× size reduction with ~5% recall loss at typical settings.
- Pinecone reports p99 query latency of ~20ms at 100M vectors with HNSW at ef=100 on their managed infrastructure.
- HNSW construction time scales roughly O(n log n) — indexing 100M vectors takes hours; IVF training on the same set takes minutes but has worse recall at low nprobe.
Senior deep-dive
The indexing algorithm determines everything — HNSW gives the best recall/speed tradeoff but is memory-hungry; IVF+PQ compresses aggressively but trades recall.
Quantization is the scaling lever: product quantization shrinks a 1536-dim float32 vector from 6KB to ~96 bytes, enabling billion-scale indexes in RAM.
Recall, not accuracy, is the metric — ANN indexes return approximate results; tuning ef/nprobe at query time trades latency for recall, and most applications can accept 95% recall at 5ms instead of 100% recall at 500ms.
HNSW vs IVF+PQ: the index choice is the whole game
HNSW (Hierarchical Navigable Small World) builds a layered proximity graph; queries traverse from coarse to fine layers in O(log n) hops. Memory cost is ~(dims × 4 bytes) + graph edges — typically 50–100 bytes/vector overhead on top of the vector itself. IVF+PQ clusters vectors into Voronoi cells (the IVF part), then compresses each vector with product quantization — much smaller, but requires scanning multiple cells (nprobe) to match HNSW recall. Choose HNSW when you have RAM and need consistent low latency; choose IVF+PQ for billion-scale on constrained memory.
Quantization: the only path to billion-scale in RAM
A billion 1536-dim float32 vectors = 5.7TB of raw vectors — impossible to hold in RAM on any reasonable cluster. Product quantization divides each vector into m sub-vectors and codes each to a centroid from a learned codebook, shrinking to m bytes per vector. At m=96, that's 5.7TB → ~90GB — fits on a single large machine. Recall loss is typically 3–7% at m=96 for text embeddings, which is acceptable for most retrieval use cases. Scalar quantization (int8) is a simpler 4× compression with almost zero recall loss.
Sharding: distribute the index, not just the data
Vector indexes are not trivially shardable: a query must either fan out to all shards (scatter-gather) or use space partitioning (route to the shard whose centroid is nearest the query). Scatter-gather is simpler — every shard returns its top-k, then a merger picks the global top-k — but latency is bounded by the slowest shard. Routing by cluster assignment reduces fanout but requires that the routing centroids are good, which degrades for out-of-distribution queries. Most production systems use scatter-gather with a small number of shards (4–16) and compensate with caching.
Deletes and updates: the hidden operational cost
HNSW has no native delete — removing a vector from the graph requires rebuilding affected edges. All production implementations use tombstone soft-deletes: the vector stays in the graph but is filtered at query result time. As tombstone density grows (>5–10%), recall degrades and query latency increases because the traversal visits dead nodes. Compaction (rebuild the index without tombstones) is expensive — plan for offline rebuild windows or use a CDC-style rolling rebuild on a secondary index.
Hybrid search: dense + sparse together
Pure ANN search misses exact keyword matches that BM25 retrieval handles well — e.g., a specific product code or a rare proper noun. Production RAG systems combine dense vector search (semantic similarity) with sparse BM25 search (keyword overlap) using a reciprocal rank fusion or a learned reranker. Elasticsearch added vector search as a native field type for this reason. The merge layer adds latency but dramatically improves retrieval quality on heterogeneous corpora.
What breaks at scale
Index build time becomes a deployment blocker: building an HNSW index over 500M vectors takes 8–20 hours on 64 cores — you can't rebuild on every schema change. Incremental index patching is the operational answer, but it requires careful coordination between the write path and the read path to avoid serving a half-updated index. The second failure is embedding model drift: when you swap the embedding model (e.g., OpenAI ada-001 to ada-002), all stored vectors are incompatible with new query vectors — you must re-embed and re-index the entire corpus, which is a full data migration at billion-vector scale.
In production
Pinecone and Weaviate use HNSW as the default index because it delivers the best recall-per-query-latency tradeoff without per-query training data assumptions. Qdrant adds scalar quantization as a lighter alternative to PQ for moderate compression. Meta's FAISS library underpins most of these — specifically `IndexHNSWFlat` for in-memory and `IndexIVFPQ` for disk-friendly compressed indexes. The real challenge at scale is index mutability: HNSW supports incremental inserts natively but deletes are soft-deletes (tombstones) that bloat the graph over time, requiring periodic graph compaction — Pinecone handles this transparently but self-hosted deployments must schedule it or watch recall degrade as tombstone density grows.
Common mistakes
- Exact search at scale (too slow)
- Ignoring the recall/speed tradeoff
- No quantization (memory blowup)