Conversational Chatbot
A multi-turn assistant that stays grounded, remembers context, and escalates when unsure.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Multi-turn dialogue
- Grounded answers (RAG)
- Conversation memory
- Escalate / handoff
Non-functional
- Low latency, streamed
- Safe + on-topic
- Cost-bounded per chat
Scale
Many concurrent sessions
The approach
Per turn: manage the window (summarize old turns), route the query (retrieve, call a tool, or answer directly), ground via RAG when factual, stream the response, and apply input/output guardrails. Persist conversation memory; detect low-confidence/out-of-scope and hand off to a human or fallback.
Key components
Session store · window manager · query router · RAG retriever · LLM · guardrail filter · escalation path
Numbers that matter
- Keep the last ~5–10 turns verbatim plus a rolling summary of the rest — the window is a budget, not a transcript.
- Route per turn: many messages (chit-chat, clarifications) need no retrieval — skipping it cuts latency and avoids derailing context.
- Stream tokens — time-to-first-token (~hundreds of ms) drives perceived speed far more than total completion time.
- Set a per-chat cost / tool-call ceiling; long multi-turn sessions are a top cost surprise.
Senior deep-dive
The context window is working memory, not storage — summarize or trim old turns or you overflow and lose the goal.
Not every turn needs retrieval — route per turn so chit-chat stays fast and irrelevant context doesn't derail the answer.
A confident wrong answer is worse than an escalation — detect uncertainty, ground factual claims, and let it say "I don't know."
Memory: the window is a budget, not a transcript
Keep recent turns verbatim plus a rolling summary of the rest — never replay the full history every turn (it overflows the window and the bill). For long-lived users, push durable facts to a separate long-term memory (vector or structured) and retrieve them on demand, exactly like documents.
Route per turn — don't retrieve blindly
Classify each turn first: chit-chat and clarifications need no retrieval; factual questions do; some need a tool call. Always-retrieving adds latency and injects irrelevant context that derails the model. A cheap router (rules or a small model) in front pays for itself.
Know when to escalate
A confident wrong answer costs more than an honest handoff. Detect low confidence and out-of-scope — empty retrieval, a frustrated user, a sensitive topic — and route to a human or a safe fallback. The maturity tell of a production bot is that it knows what it doesn't know.
Guardrails run both ways
Moderate the input and the output. Input: catch prompt-injection and abuse before they reach the model. Output: block unsafe, off-brand, or ungrounded replies before they reach the user. Treat user and retrieved text as untrusted — never let it silently rewrite your instructions.
Latency is a product feature
Stream tokens — time-to-first-token drives perceived speed far more than total time. Run retrieval and tool calls in parallel where you can, and keep each turn's prompt lean (summaries, not transcripts) so prefill stays cheap. A snappy stream beats a slow perfect answer.
What breaks at scale
Many concurrent sessions make session state and memory storage real infrastructure, make per-chat cost ceilings necessary to stop runaway loops, and force explicit retention/privacy rules for stored history. Watch cost-per-conversation and escalation rate as your two north-star ops metrics.
In production
Intercom Fin, ChatGPT, and support bots combine windowed memory + summarization, optional RAG, streaming, and in/out moderation — with an escalation path to a human. The maturity tell is the handoff: good bots know when they do not know and route to a person.
Common mistakes
- Stuffing full history every turn → overflow + cost
- Retrieving on every turn, even chit-chat
- No escalation path → confident wrong answers
- Memory with no summarization → window blowup