Agentic AI Systems

Conversational Chatbot

A multi-turn assistant that stays grounded, remembers context, and escalates when unsure.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Multi-turn dialogue
Grounded answers (RAG)
Conversation memory
Escalate / handoff

Non-functional

Low latency, streamed
Safe + on-topic
Cost-bounded per chat

Scale

Many concurrent sessions

The approach

Per turn: manage the window (summarize old turns), route the query (retrieve, call a tool, or answer directly), ground via RAG when factual, stream the response, and apply input/output guardrails. Persist conversation memory; detect low-confidence/out-of-scope and hand off to a human or fallback.

Key components

Session store · window manager · query router · RAG retriever · LLM · guardrail filter · escalation path

Numbers that matter

Keep the last ~5–10 turns verbatim plus a rolling summary of the rest — the window is a budget, not a transcript.
Route per turn: many messages (chit-chat, clarifications) need no retrieval — skipping it cuts latency and avoids derailing context.
Stream tokens — time-to-first-token (~hundreds of ms) drives perceived speed far more than total completion time.
Set a per-chat cost / tool-call ceiling; long multi-turn sessions are a top cost surprise.

Senior deep-dive

The context window is working memory, not storage — summarize or trim old turns or you overflow and lose the goal.

Not every turn needs retrieval — route per turn so chit-chat stays fast and irrelevant context doesn't derail the answer.

A confident wrong answer is worse than an escalation — detect uncertainty, ground factual claims, and let it say "I don't know."

Memory: the window is a budget, not a transcript

Keep recent turns verbatim plus a rolling summary of the rest — never replay the full history every turn (it overflows the window and the bill). For long-lived users, push durable facts to a separate long-term memory (vector or structured) and retrieve them on demand, exactly like documents.

Route per turn — don't retrieve blindly

Classify each turn first: chit-chat and clarifications need no retrieval; factual questions do; some need a tool call. Always-retrieving adds latency and injects irrelevant context that derails the model. A cheap router (rules or a small model) in front pays for itself.

Know when to escalate

A confident wrong answer costs more than an honest handoff. Detect low confidence and out-of-scope — empty retrieval, a frustrated user, a sensitive topic — and route to a human or a safe fallback. The maturity tell of a production bot is that it knows what it doesn't know.

Guardrails run both ways

Moderate the input and the output. Input: catch prompt-injection and abuse before they reach the model. Output: block unsafe, off-brand, or ungrounded replies before they reach the user. Treat user and retrieved text as untrusted — never let it silently rewrite your instructions.

Latency is a product feature

Stream tokens — time-to-first-token drives perceived speed far more than total time. Run retrieval and tool calls in parallel where you can, and keep each turn's prompt lean (summaries, not transcripts) so prefill stays cheap. A snappy stream beats a slow perfect answer.

What breaks at scale

Many concurrent sessions make session state and memory storage real infrastructure, make per-chat cost ceilings necessary to stop runaway loops, and force explicit retention/privacy rules for stored history. Watch cost-per-conversation and escalation rate as your two north-star ops metrics.

In production

Intercom Fin, ChatGPT, and support bots combine windowed memory + summarization, optional RAG, streaming, and in/out moderation — with an escalation path to a human. The maturity tell is the handoff: good bots know when they do not know and route to a person.

Common mistakes

Stuffing full history every turn → overflow + cost
Retrieving on every turn, even chit-chat
No escalation path → confident wrong answers
Memory with no summarization → window blowup

Related Agentic AI Systems

Part of Agentic AI Systems on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →