Production RAG System
Answer questions over a private, changing corpus — grounded, cited, and access-controlled.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Ingest & re-index documents
- Retrieve + rerank relevant chunks
- Generate cited, grounded answers
- Honest refusal when no context
Non-functional
- Per-tenant isolation / ACLs
- Fresh within minutes of change
- Low p95 answer latency
Scale
Millions of chunks, many tenants, read-heavy
The approach
Ingestion pipeline (parse → chunk → embed → index with metadata/ACLs) feeds a vector store. Query path: embed → hybrid retrieve (dense + BM25) with ACL filter → cross-encoder rerank → prompt with cited context and a refuse-if-empty instruction → stream answer. Re-index on document change; evaluate retrieval and generation separately and continuously.
Key components
Ingest workers · embedding model · vector DB (HNSW) · BM25 index · reranker · LLM · eval + trace store
Numbers that matter
- Chunk 256–512 tokens; retrieve ~20–50 candidates, rerank to the ~3–8 you actually fit in the prompt — context is scarce and expensive.
- Re-index latency (doc change → searchable) is the real freshness SLA — aim seconds-to-minutes; it is not a model-latency problem.
- Track faithfulness/groundedness (does every claim trace to a cited span?), not just "helpfulness" — it is what users trust.
- Most production hallucinations are retrieval misses: if the right chunk is not in the top-k, no prompt wording recovers it.
Senior deep-dive
Most "the LLM is wrong" bugs are retrieval misses, not generation — if the right chunk never makes the top-k, no prompt wording recovers it. So evaluate retrieval (recall@k) and generation (faithfulness) separately.
Enforce ACLs at retrieval, never in the prompt — prompt-level rules are bypassable and leak across tenants.
"No relevant context" must refuse, not invent — a grounded "I don't know" beats a confident hallucination.
Retrieval quality is the whole game
The model can only reason over what you put in the context window — bad retrieval, bad answer. Before touching the prompt or the model, measure recall@k: was the right chunk even a candidate? Most production RAG failures live here, not in generation, so fix chunking and reranking first.
Grounding and citations are non-negotiable
Instruct the model to answer only from the provided context and cite the span for every claim, then verify with a faithfulness check (an LLM-judge or NLI model over the cited spans). Citations aren't decoration — they're how you detect hallucination and how users learn to trust the answer.
ACLs belong in retrieval, not the prompt
Filter by tenant and document permissions during retrieval, so forbidden chunks are never candidates. "The system prompt says don't reveal other tenants' data" is not security — it's one prompt injection away from a breach. This is the single most dangerous RAG bug.
Freshness is a pipeline problem, not a model one
Knowledge updates by re-indexing on document change, not retraining. Track index lag (doc changed → searchable) as an SLA — stale answers are a freshness-pipeline failure. Change-detect, re-embed, upsert, and invalidate any cached answers tied to the changed docs.
Evaluate the two halves separately
One end-to-end "looks good" score hides which half is broken. Retrieval: recall@k / nDCG on a judged set. Generation: faithfulness + answer-correctness on the same set. Build it from real (especially failed) queries and grow it from every incident, or it rots as the corpus drifts.
What breaks at scale
Many tenants × millions of chunks forces per-tenant index scoping (a global index both leaks and bloats), makes re-index throughput the bottleneck, and makes cost per query (embed + ANN + rerank + LLM) dominate the bill. Cache aggressively: query embeddings, retrieved sets, and full answers for repeated questions.
In production
Perplexity, Glean, and Notion AI are RAG over a private/permissioned corpus — hybrid retrieve → rerank → grounded, cited generation. Frameworks (LlamaIndex, LangChain) and managed stacks (Vertex / Bedrock knowledge bases) ship this shape; the differentiator is retrieval quality and ACL correctness, not the LLM.
Common mistakes
- ACLs only in the prompt → cross-tenant leak
- No reranker → noisy context, weak answers
- No refusal path → confident hallucination
- Evaluating end-to-end only, never retrieval alone