Agentic AI Systems

LLM Eval Pipeline

Measure LLM/agent quality so every prompt, model, or pipeline change is a verifiable diff.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Versioned eval datasets
  • Offline scoring (rules + LLM-judge)
  • Online quality signals
  • Regression gating in CI

Non-functional

  • Reproducible runs
  • Bias-controlled judging
  • Cheap enough to run often

Scale

Thousands of cases, frequent runs

The approach

Curate datasets from real (esp. failed) production queries with explicit success criteria, stratified by segment. Score offline with deterministic checks (must-contain, must-cite, must-refuse) plus a calibrated LLM-judge (pairwise, order-randomized). Gate changes in CI on the diff. Sample live traffic for online evals; feed every incident back as a new case.

Key components

Dataset store (versioned) · runner · rule scorers · LLM-judge (calibrated) · CI gate · online sampler

Numbers that matter

Senior deep-dive

Evals are a test suite for non-determinism — without them, every prompt or model change is a blind diff.

An LLM-judge must be calibrated against human labels and bias-controlled (randomize order, prefer pairwise) or you optimize the wrong thing.

Source cases from production failures, not invented happy paths — and grow the set from every incident, or it rots.

Offline and online evals catch different failures

Offline evals (a fixed judged set in CI) catch regressions before ship. Online evals (sampled live traffic) catch distribution shift — the inputs your dataset never imagined. You need both: offline is your gate, online is your smoke detector.

The LLM-judge is the part most teams get wrong

An uncalibrated judge is confidently wrong. Anchor it to a human-labeled slice and report agreement (e.g. Cohen's κ); if agreement is low, fix the rubric before trusting any score. Control bias: randomize answer order, prefer pairwise comparison, and watch for length/position favoritism.

Cheap deterministic checks first, judges second

Most regressions are caught by rule scorers — must-cite, must-refuse, valid-JSON, exact-match — which are free and reproducible. Run them first and reserve the expensive LLM-judge for the subjective dimensions (helpfulness, tone, reasoning) it is actually needed for.

Datasets are the asset — version and grow them

A few hundred well-chosen, stratified cases beat thousands of random ones. Version the dataset like code, stratify by segment and failure-mode, and turn every production incident into a new case. A static eval set silently rots as inputs drift, giving false confidence.

Gate changes in CI on the diff

Treat evals like unit tests: block the merge if the metric regresses past a threshold, and report the per-case diff so a reviewer sees exactly what changed. Without a gate, quality erodes one "small" prompt tweak at a time and nobody notices until users do.

What breaks at scale

Frequent runs make judge cost and latency matter — sample, cache, and use small judge models where they suffice. As suites grow, flaky judges and dataset drift become the maintenance burden; track judge–human agreement over time, not just the headline score.

In production

OpenAI Evals, Braintrust, LangSmith, and Humanloop all model this: versioned datasets, rule + LLM-judge scorers, and a CI gate on the diff. Teams that ship LLM features reliably treat evals like unit tests — every incident becomes a new case.

Common mistakes

Related Agentic AI Systems

Part of Agentic AI Systems on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →