System Design Library

Vector Database

Find nearest-neighbor vectors among billions for semantic search/RAG.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Insert embeddings
k-NN similarity search
Filter + search
Updates

Non-functional

Low-latency ANN
Scales to billions
High recall

Scale

Billions of vectors

The approach

Approximate nearest neighbor (ANN) indexes — HNSW (navigable small-world graphs) or IVF/PQ — trade a little recall for huge speed vs exact search; sharded across nodes; quantization compresses vectors; metadata filtering combined with vector search.

Key components

Embedder → ANN index (HNSW/IVF, sharded) → query

Numbers that matter

HNSW at 96 dimensions achieves ~99% recall at ~1ms query latency; scaling to 1536 dims (OpenAI ada-002) pushes memory per vector to ~6KB uncompressed.
Product quantization (PQ with m=96 subspaces) compresses 1536-dim float32 to ~96 bytes — a 64× size reduction with ~5% recall loss at typical settings.
Pinecone reports p99 query latency of ~20ms at 100M vectors with HNSW at ef=100 on their managed infrastructure.
HNSW construction time scales roughly O(n log n) — indexing 100M vectors takes hours; IVF training on the same set takes minutes but has worse recall at low nprobe.

Senior deep-dive

The indexing algorithm determines everything — HNSW gives the best recall/speed tradeoff but is memory-hungry; IVF+PQ compresses aggressively but trades recall.

Quantization is the scaling lever: product quantization shrinks a 1536-dim float32 vector from 6KB to ~96 bytes, enabling billion-scale indexes in RAM.

Recall, not accuracy, is the metric — ANN indexes return approximate results; tuning ef/nprobe at query time trades latency for recall, and most applications can accept 95% recall at 5ms instead of 100% recall at 500ms.

HNSW vs IVF+PQ: the index choice is the whole game

HNSW (Hierarchical Navigable Small World) builds a layered proximity graph; queries traverse from coarse to fine layers in O(log n) hops. Memory cost is ~(dims × 4 bytes) + graph edges — typically 50–100 bytes/vector overhead on top of the vector itself. IVF+PQ clusters vectors into Voronoi cells (the IVF part), then compresses each vector with product quantization — much smaller, but requires scanning multiple cells (nprobe) to match HNSW recall. Choose HNSW when you have RAM and need consistent low latency; choose IVF+PQ for billion-scale on constrained memory.

Quantization: the only path to billion-scale in RAM

A billion 1536-dim float32 vectors = 5.7TB of raw vectors — impossible to hold in RAM on any reasonable cluster. Product quantization divides each vector into m sub-vectors and codes each to a centroid from a learned codebook, shrinking to m bytes per vector. At m=96, that's 5.7TB → ~90GB — fits on a single large machine. Recall loss is typically 3–7% at m=96 for text embeddings, which is acceptable for most retrieval use cases. Scalar quantization (int8) is a simpler 4× compression with almost zero recall loss.

Sharding: distribute the index, not just the data

Vector indexes are not trivially shardable: a query must either fan out to all shards (scatter-gather) or use space partitioning (route to the shard whose centroid is nearest the query). Scatter-gather is simpler — every shard returns its top-k, then a merger picks the global top-k — but latency is bounded by the slowest shard. Routing by cluster assignment reduces fanout but requires that the routing centroids are good, which degrades for out-of-distribution queries. Most production systems use scatter-gather with a small number of shards (4–16) and compensate with caching.

Deletes and updates: the hidden operational cost

HNSW has no native delete — removing a vector from the graph requires rebuilding affected edges. All production implementations use tombstone soft-deletes: the vector stays in the graph but is filtered at query result time. As tombstone density grows (>5–10%), recall degrades and query latency increases because the traversal visits dead nodes. Compaction (rebuild the index without tombstones) is expensive — plan for offline rebuild windows or use a CDC-style rolling rebuild on a secondary index.

Hybrid search: dense + sparse together

Pure ANN search misses exact keyword matches that BM25 retrieval handles well — e.g., a specific product code or a rare proper noun. Production RAG systems combine dense vector search (semantic similarity) with sparse BM25 search (keyword overlap) using a reciprocal rank fusion or a learned reranker. Elasticsearch added vector search as a native field type for this reason. The merge layer adds latency but dramatically improves retrieval quality on heterogeneous corpora.

What breaks at scale

Index build time becomes a deployment blocker: building an HNSW index over 500M vectors takes 8–20 hours on 64 cores — you can't rebuild on every schema change. Incremental index patching is the operational answer, but it requires careful coordination between the write path and the read path to avoid serving a half-updated index. The second failure is embedding model drift: when you swap the embedding model (e.g., OpenAI ada-001 to ada-002), all stored vectors are incompatible with new query vectors — you must re-embed and re-index the entire corpus, which is a full data migration at billion-vector scale.

In production

Pinecone and Weaviate use HNSW as the default index because it delivers the best recall-per-query-latency tradeoff without per-query training data assumptions. Qdrant adds scalar quantization as a lighter alternative to PQ for moderate compression. Meta's FAISS library underpins most of these — specifically `IndexHNSWFlat` for in-memory and `IndexIVFPQ` for disk-friendly compressed indexes. The real challenge at scale is index mutability: HNSW supports incremental inserts natively but deletes are soft-deletes (tombstones) that bloat the graph over time, requiring periodic graph compaction — Pinecone handles this transparently but self-hosted deployments must schedule it or watch recall degrade as tombstone density grows.

Common mistakes

Exact search at scale (too slow)
Ignoring the recall/speed tradeoff
No quantization (memory blowup)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →