Agentic AI Systems

Vector Database

Store and search billions of embeddings by similarity with low latency and tunable recall.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Upsert / delete vectors
  • Top-k ANN search
  • Metadata filtering
  • Hybrid (dense + sparse)

Non-functional

  • Tunable recall vs latency
  • Horizontal scale
  • Fresh inserts visible quickly

Scale

Billions of vectors, high QPS

The approach

Vectors sharded across nodes; each shard builds an ANN index (HNSW for recall, or IVF+PQ for memory/scale). Queries scatter to shards, gather top-k, and merge. Metadata filters apply pre-search to scope candidates and enforce tenancy. A separate sparse/BM25 index supports hybrid; results fuse via rank fusion.

Key components

Router · index shards (HNSW/IVF+PQ) · metadata filter · sparse index · merge/rank-fusion layer

Numbers that matter

Senior deep-dive

ANN is approximate by definition — the core dial is recall vs latency vs memory, not an exact lookup.

HNSW gives the best recall/latency but is RAM-heavy and costly to update; IVF+PQ scales memory and ingest at some recall cost.

Filter before ranking for correctness and tenancy — post-filtering returns short, empty, or leaky results.

Approximate, not exact — and that is the point

Exact nearest-neighbour over millions of vectors is too slow, so ANN trades a little recall for orders-of-magnitude speed. The knob is explicit: ef_search (HNSW) or nprobe (IVF) — higher means better recall and more latency. Treating ANN recall as exact is the classic mistake; measure it against a brute-force baseline.

HNSW vs IVF+PQ — the real decision

HNSW: best recall/latency, RAM-heavy, expensive to update — pick it when vectors fit in memory and updates are modest. IVF+PQ: quantizes vectors 4–32× smaller, scales to billions with faster ingest, at some recall cost you claw back with nprobe + reranking. It is a recall-vs-memory-vs-update-cost call, not a default.

Filter inside retrieval, never after

Apply tenant/metadata filters during the ANN/BM25 query (or via per-tenant namespaces). Post-filtering after fetching k can return near-zero results and, worse, risks leaking another tenant's vectors into the candidate set. This is both a relevance and a security boundary.

Fresh writes fight the immutable graph

HNSW graphs are expensive to mutate, so new vectors aren't instantly searchable. Buffer recent writes in a small flat (brute-force) index searched in parallel, then merge/rebuild on a schedule. Track index lag — "my new doc isn't found yet" is a merge-cadence issue, not a bug.

Sharding and the scatter-gather query

At scale, vectors shard across nodes; each shard searches its own index and a coordinator merges top-k. Recall now depends on per-shard ef and how results merge. One global index instead of per-tenant/shard scoping both bloats memory and risks leakage — scope deliberately.

What breaks at scale

1B vectors × 768 dims × 4 bytes ≈ 3 TB raw — quantization and sharding stop being optional. RAM (HNSW) or recall (PQ) becomes the ceiling, rebuild/compaction time grows, and tail latency is dominated by the slowest shard. Capacity-plan on memory and recall targets, not QPS alone.

In production

Pinecone, Weaviate, Qdrant, Milvus, and pgvector all expose HNSW (often IVF+PQ too) with metadata filtering; FAISS is the library many build on. The product differences are operational — sharding, filter correctness, freshness — not the core ANN math.

Common mistakes

Related Agentic AI Systems

Part of Agentic AI Systems on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →