LLM Evaluation: How to test prompts, RAG and agents
A production-grade guide to LLM evaluation - what breaks without it, what to measure, how to write an LLM-as-judge and an eval runner, and how to gate releases the way you gate code on tests.
Evaluation is the highest-leverage practice in LLMOps and the one teams most often skip. Get it right and every other change - a new prompt, a cheaper model, a different retriever - becomes a measured decision instead of a gamble. Get it wrong and you are shipping on vibes.
The two questions
Every layer of the stack is the gap between a prototype question and a production one. For evaluation:
Prototype: “Does it answer correctly in the demo I just ran?” Production: “Does it pass evals on real cases, every time we change something?”
The demo answers one input, once. Production has to answer thousands, after each edit, without regressing the cases that matter. An eval dataset - a curated set of inputs with checkable expectations - is what closes that gap.
What breaks in production without it
These are the failure modes an eval set prevents:
- You can’t tell if a change helped. A prompt tweak “feels better”; you ship it; a week later a different cohort is worse. With no baseline, you never find out which change did it.
- Silent regressions. A model or provider update subtly shifts behaviour. No eval gate means no alarm until users complain.
- A global pass rate that lies. “92% accuracy” can hide a 40% pass rate on the one workflow - refunds, medical, legal - where being wrong is expensive.
- Over-fitting. If you only ever tune against the same 20 cases, you optimise for them and regress everything else.
- Unrepresentative data. An eval set of inputs you imagined tests a product that doesn’t exist. Real traffic is harder and weirder.
What to measure
Pick metrics tied to how the system actually fails, not generic “quality”:
- Task success / accuracy - did it do the job? (exact match, or rubric.)
- Faithfulness - is the answer grounded in the provided context? The core metric for RAG.
- Safety & refusal - does it refuse what it should, and only that?
- Relevance & tone - usually judged by an LLM-as-judge with a rubric.
Always slice these by a risk_category so a healthy average can’t hide a broken
high-stakes category.
Instrument it: the eval row
Keep each case declarative - input, expected behaviour, and machine-checkable assertions:
{
"input": "Can I get a refund after 45 days?",
"expected_behavior": "Answer using refund policy only",
"must_include": ["policy window", "support escalation"],
"must_not_include": ["invented exceptions"],
"risk_category": "customer_support_policy"
}
The must_include / must_not_include lists turn a fuzzy judgement into a
deterministic check you can run for free, with no second model. Build your first
set from real failures - see how to build your first eval dataset.
Instrument it: the LLM-as-judge
For subjective qualities (faithfulness, helpfulness), a second model scores the answer against a narrow rubric. Keep its job binary and grounded - return JSON so you can aggregate it:
import json
def faithfulness_judge(answer: str, context: str) -> dict:
prompt = (
"Using ONLY the context, is every claim in the answer supported?\n"
'Reply JSON: {"verdict": "supported|unsupported", "unsupported": [...]}\n\n'
f"Context:\n{context}\n\nAnswer:\n{answer}"
)
out = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(out.choices[0].message.content)
Calibrate the judge before you trust it: have a human label ~30 cases and compare. If they agree, automate; if not, tighten the rubric. Re-calibrate when you change the judge model. (The canonical study on this is Judging LLM-as-a-Judge with MT-Bench - linked in the sources below.)
Run it like a test suite
An eval that runs manually gets skipped. A minimal runner:
def run_evals(dataset, answer_fn):
results = []
for case in dataset:
answer = answer_fn(case["input"])
deterministic = (
all(s.lower() in answer.lower() for s in case.get("must_include", []))
and not any(s.lower() in answer.lower() for s in case.get("must_not_include", []))
)
results.append({"case": case, "passed": deterministic})
pass_rate = sum(r["passed"] for r in results) / len(results)
return pass_rate, results
Then gate it in CI so a regression can’t merge:
# .github/workflows/eval.yml (sketch)
on:
pull_request:
paths: ['prompts/**', 'src/**', 'evals/**']
jobs:
eval:
steps:
- run: python run_evals.py --data evals/
# fail the build if pass_rate drops below threshold
# or any case tagged "critical" regresses
This is the whole point: evaluation gives an LLM app the same safety net unit tests give normal code.
Minimal vs mature
Ship the minimal column to get a signal today; the mature column is what makes it trustworthy.
| Aspect | Minimal | Production-grade |
|---|---|---|
| Dataset | 20–50 real cases | 100s, growing from every incident |
| Scoring | must_include / exclude | + LLM-as-judge, calibrated |
| Coverage | Top intents | Sliced by risk_category |
| Cadence | Run before a big change | Gated in CI on every change |
| Regressions | Spotted by hand | Block the deploy automatically |
Evaluating RAG and agents
- RAG: separate retrieval quality from generation quality - most “wrong answer” bugs are retrieval bugs. Measure whether the right context was fetched before you judge the answer. See RAGOps.
- Agents: evaluate at the step level, not just the final answer. Use the trace to check the agent chose the right tool with the right arguments - a correct final answer can still hide a wrong, expensive path.
Tools and where to go next
You don’t have to build the harness yourself - the evals category of the directory covers Braintrust, LangSmith, OpenAI Evals and Giskard, filterable by open-source, self-hostable and more.
Then turn it into a habit: the evaluation items in the Production Checklist are the bar to clear, and the Maturity Score tells you how your eval practice compares across the rest of the stack. For a guided route in, the learning page lists a free short course on evaluating and debugging generative AI.