LLM Eval Pipeline
Measure LLM/agent quality so every prompt, model, or pipeline change is a verifiable diff.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Versioned eval datasets
- Offline scoring (rules + LLM-judge)
- Online quality signals
- Regression gating in CI
Non-functional
- Reproducible runs
- Bias-controlled judging
- Cheap enough to run often
Scale
Thousands of cases, frequent runs
The approach
Curate datasets from real (esp. failed) production queries with explicit success criteria, stratified by segment. Score offline with deterministic checks (must-contain, must-cite, must-refuse) plus a calibrated LLM-judge (pairwise, order-randomized). Gate changes in CI on the diff. Sample live traffic for online evals; feed every incident back as a new case.
Key components
Dataset store (versioned) · runner · rule scorers · LLM-judge (calibrated) · CI gate · online sampler
Numbers that matter
- A few hundred well-chosen, stratified cases beat thousands of random ones — failure-mode coverage matters more than volume.
- Calibrate the LLM-judge against a human-labeled slice and report judge–human agreement (e.g. Cohen's κ) before trusting its scores.
- Randomize answer order and prefer pairwise comparison — position and verbosity bias can swing a judge substantially.
- Run cheap deterministic checks (must-cite, must-refuse, valid-JSON) first; they catch most regressions before you spend on judge calls.
Senior deep-dive
Evals are a test suite for non-determinism — without them, every prompt or model change is a blind diff.
An LLM-judge must be calibrated against human labels and bias-controlled (randomize order, prefer pairwise) or you optimize the wrong thing.
Source cases from production failures, not invented happy paths — and grow the set from every incident, or it rots.
Offline and online evals catch different failures
Offline evals (a fixed judged set in CI) catch regressions before ship. Online evals (sampled live traffic) catch distribution shift — the inputs your dataset never imagined. You need both: offline is your gate, online is your smoke detector.
The LLM-judge is the part most teams get wrong
An uncalibrated judge is confidently wrong. Anchor it to a human-labeled slice and report agreement (e.g. Cohen's κ); if agreement is low, fix the rubric before trusting any score. Control bias: randomize answer order, prefer pairwise comparison, and watch for length/position favoritism.
Cheap deterministic checks first, judges second
Most regressions are caught by rule scorers — must-cite, must-refuse, valid-JSON, exact-match — which are free and reproducible. Run them first and reserve the expensive LLM-judge for the subjective dimensions (helpfulness, tone, reasoning) it is actually needed for.
Datasets are the asset — version and grow them
A few hundred well-chosen, stratified cases beat thousands of random ones. Version the dataset like code, stratify by segment and failure-mode, and turn every production incident into a new case. A static eval set silently rots as inputs drift, giving false confidence.
Gate changes in CI on the diff
Treat evals like unit tests: block the merge if the metric regresses past a threshold, and report the per-case diff so a reviewer sees exactly what changed. Without a gate, quality erodes one "small" prompt tweak at a time and nobody notices until users do.
What breaks at scale
Frequent runs make judge cost and latency matter — sample, cache, and use small judge models where they suffice. As suites grow, flaky judges and dataset drift become the maintenance burden; track judge–human agreement over time, not just the headline score.
In production
OpenAI Evals, Braintrust, LangSmith, and Humanloop all model this: versioned datasets, rule + LLM-judge scorers, and a CI gate on the diff. Teams that ship LLM features reliably treat evals like unit tests — every incident becomes a new case.
Common mistakes
- Uncalibrated LLM-judge → confident wrong scores
- Easy invented cases → false confidence
- Static dataset → rots as inputs drift
- No CI gate → silent regressions ship