Agentic AI Systems

Production RAG System

Answer questions over a private, changing corpus — grounded, cited, and access-controlled.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Ingest & re-index documents
  • Retrieve + rerank relevant chunks
  • Generate cited, grounded answers
  • Honest refusal when no context

Non-functional

  • Per-tenant isolation / ACLs
  • Fresh within minutes of change
  • Low p95 answer latency

Scale

Millions of chunks, many tenants, read-heavy

The approach

Ingestion pipeline (parse → chunk → embed → index with metadata/ACLs) feeds a vector store. Query path: embed → hybrid retrieve (dense + BM25) with ACL filter → cross-encoder rerank → prompt with cited context and a refuse-if-empty instruction → stream answer. Re-index on document change; evaluate retrieval and generation separately and continuously.

Key components

Ingest workers · embedding model · vector DB (HNSW) · BM25 index · reranker · LLM · eval + trace store

Numbers that matter

Senior deep-dive

Most "the LLM is wrong" bugs are retrieval misses, not generation — if the right chunk never makes the top-k, no prompt wording recovers it. So evaluate retrieval (recall@k) and generation (faithfulness) separately.

Enforce ACLs at retrieval, never in the prompt — prompt-level rules are bypassable and leak across tenants.

"No relevant context" must refuse, not invent — a grounded "I don't know" beats a confident hallucination.

Retrieval quality is the whole game

The model can only reason over what you put in the context window — bad retrieval, bad answer. Before touching the prompt or the model, measure recall@k: was the right chunk even a candidate? Most production RAG failures live here, not in generation, so fix chunking and reranking first.

Grounding and citations are non-negotiable

Instruct the model to answer only from the provided context and cite the span for every claim, then verify with a faithfulness check (an LLM-judge or NLI model over the cited spans). Citations aren't decoration — they're how you detect hallucination and how users learn to trust the answer.

ACLs belong in retrieval, not the prompt

Filter by tenant and document permissions during retrieval, so forbidden chunks are never candidates. "The system prompt says don't reveal other tenants' data" is not security — it's one prompt injection away from a breach. This is the single most dangerous RAG bug.

Freshness is a pipeline problem, not a model one

Knowledge updates by re-indexing on document change, not retraining. Track index lag (doc changed → searchable) as an SLA — stale answers are a freshness-pipeline failure. Change-detect, re-embed, upsert, and invalidate any cached answers tied to the changed docs.

Evaluate the two halves separately

One end-to-end "looks good" score hides which half is broken. Retrieval: recall@k / nDCG on a judged set. Generation: faithfulness + answer-correctness on the same set. Build it from real (especially failed) queries and grow it from every incident, or it rots as the corpus drifts.

What breaks at scale

Many tenants × millions of chunks forces per-tenant index scoping (a global index both leaks and bloats), makes re-index throughput the bottleneck, and makes cost per query (embed + ANN + rerank + LLM) dominate the bill. Cache aggressively: query embeddings, retrieved sets, and full answers for repeated questions.

In production

Perplexity, Glean, and Notion AI are RAG over a private/permissioned corpus — hybrid retrieve → rerank → grounded, cited generation. Frameworks (LlamaIndex, LangChain) and managed stacks (Vertex / Bedrock knowledge bases) ship this shape; the differentiator is retrieval quality and ACL correctness, not the LLM.

Common mistakes

Related Agentic AI Systems

Part of Agentic AI Systems on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →