Agentic AI Systems

Conversational Chatbot

A multi-turn assistant that stays grounded, remembers context, and escalates when unsure.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Multi-turn dialogue
  • Grounded answers (RAG)
  • Conversation memory
  • Escalate / handoff

Non-functional

  • Low latency, streamed
  • Safe + on-topic
  • Cost-bounded per chat

Scale

Many concurrent sessions

The approach

Per turn: manage the window (summarize old turns), route the query (retrieve, call a tool, or answer directly), ground via RAG when factual, stream the response, and apply input/output guardrails. Persist conversation memory; detect low-confidence/out-of-scope and hand off to a human or fallback.

Key components

Session store · window manager · query router · RAG retriever · LLM · guardrail filter · escalation path

Numbers that matter

Senior deep-dive

The context window is working memory, not storage — summarize or trim old turns or you overflow and lose the goal.

Not every turn needs retrieval — route per turn so chit-chat stays fast and irrelevant context doesn't derail the answer.

A confident wrong answer is worse than an escalation — detect uncertainty, ground factual claims, and let it say "I don't know."

Memory: the window is a budget, not a transcript

Keep recent turns verbatim plus a rolling summary of the rest — never replay the full history every turn (it overflows the window and the bill). For long-lived users, push durable facts to a separate long-term memory (vector or structured) and retrieve them on demand, exactly like documents.

Route per turn — don't retrieve blindly

Classify each turn first: chit-chat and clarifications need no retrieval; factual questions do; some need a tool call. Always-retrieving adds latency and injects irrelevant context that derails the model. A cheap router (rules or a small model) in front pays for itself.

Know when to escalate

A confident wrong answer costs more than an honest handoff. Detect low confidence and out-of-scope — empty retrieval, a frustrated user, a sensitive topic — and route to a human or a safe fallback. The maturity tell of a production bot is that it knows what it doesn't know.

Guardrails run both ways

Moderate the input and the output. Input: catch prompt-injection and abuse before they reach the model. Output: block unsafe, off-brand, or ungrounded replies before they reach the user. Treat user and retrieved text as untrusted — never let it silently rewrite your instructions.

Latency is a product feature

Stream tokens — time-to-first-token drives perceived speed far more than total time. Run retrieval and tool calls in parallel where you can, and keep each turn's prompt lean (summaries, not transcripts) so prefill stays cheap. A snappy stream beats a slow perfect answer.

What breaks at scale

Many concurrent sessions make session state and memory storage real infrastructure, make per-chat cost ceilings necessary to stop runaway loops, and force explicit retention/privacy rules for stored history. Watch cost-per-conversation and escalation rate as your two north-star ops metrics.

In production

Intercom Fin, ChatGPT, and support bots combine windowed memory + summarization, optional RAG, streaming, and in/out moderation — with an escalation path to a human. The maturity tell is the handoff: good bots know when they do not know and route to a person.

Common mistakes

Related Agentic AI Systems

Part of Agentic AI Systems on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →