LLM Evaluation: How to test prompts, RAG and agents

A production-grade guide to LLM evaluation - what breaks without it, what to measure, how to write an LLM-as-judge and an eval runner, and how to gate releases the way you gate code on tests.

Evaluation is the highest-leverage practice in LLMOps and the one teams most often skip. Get it right and every other change - a new prompt, a cheaper model, a different retriever - becomes a measured decision instead of a gamble. Get it wrong and you are shipping on vibes.

The two questions

Every layer of the stack is the gap between a prototype question and a production one. For evaluation:

Prototype: “Does it answer correctly in the demo I just ran?” Production: “Does it pass evals on real cases, every time we change something?”

The demo answers one input, once. Production has to answer thousands, after each edit, without regressing the cases that matter. An eval dataset - a curated set of inputs with checkable expectations - is what closes that gap.

What breaks in production without it

These are the failure modes an eval set prevents:

You can’t tell if a change helped. A prompt tweak “feels better”; you ship it; a week later a different cohort is worse. With no baseline, you never find out which change did it.
Silent regressions. A model or provider update subtly shifts behaviour. No eval gate means no alarm until users complain.
A global pass rate that lies. “92% accuracy” can hide a 40% pass rate on the one workflow - refunds, medical, legal - where being wrong is expensive.
Over-fitting. If you only ever tune against the same 20 cases, you optimise for them and regress everything else.
Unrepresentative data. An eval set of inputs you imagined tests a product that doesn’t exist. Real traffic is harder and weirder.

What to measure

Pick metrics tied to how the system actually fails, not generic “quality”:

Task success / accuracy - did it do the job? (exact match, or rubric.)
Faithfulness - is the answer grounded in the provided context? The core metric for RAG.
Safety & refusal - does it refuse what it should, and only that?
Relevance & tone - usually judged by an LLM-as-judge with a rubric.

Always slice these by a risk_category so a healthy average can’t hide a broken high-stakes category.

Instrument it: the eval row

Keep each case declarative - input, expected behaviour, and machine-checkable assertions:

{
  "input": "Can I get a refund after 45 days?",
  "expected_behavior": "Answer using refund policy only",
  "must_include": ["policy window", "support escalation"],
  "must_not_include": ["invented exceptions"],
  "risk_category": "customer_support_policy"
}

The must_include / must_not_include lists turn a fuzzy judgement into a deterministic check you can run for free, with no second model. Build your first set from real failures - see how to build your first eval dataset.

Instrument it: the LLM-as-judge

For subjective qualities (faithfulness, helpfulness), a second model scores the answer against a narrow rubric. Keep its job binary and grounded - return JSON so you can aggregate it:

import json

def faithfulness_judge(answer: str, context: str) -> dict:
    prompt = (
        "Using ONLY the context, is every claim in the answer supported?\n"
        'Reply JSON: {"verdict": "supported|unsupported", "unsupported": [...]}\n\n'
        f"Context:\n{context}\n\nAnswer:\n{answer}"
    )
    out = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(out.choices[0].message.content)

Calibrate the judge before you trust it: have a human label ~30 cases and compare. If they agree, automate; if not, tighten the rubric. Re-calibrate when you change the judge model. (The canonical study on this is Judging LLM-as-a-Judge with MT-Bench - linked in the sources below.)

Run it like a test suite

An eval that runs manually gets skipped. A minimal runner:

def run_evals(dataset, answer_fn):
    results = []
    for case in dataset:
        answer = answer_fn(case["input"])
        deterministic = (
            all(s.lower() in answer.lower() for s in case.get("must_include", []))
            and not any(s.lower() in answer.lower() for s in case.get("must_not_include", []))
        )
        results.append({"case": case, "passed": deterministic})
    pass_rate = sum(r["passed"] for r in results) / len(results)
    return pass_rate, results

Then gate it in CI so a regression can’t merge:

# .github/workflows/eval.yml (sketch)
on:
  pull_request:
    paths: ['prompts/**', 'src/**', 'evals/**']
jobs:
  eval:
    steps:
      - run: python run_evals.py --data evals/
      # fail the build if pass_rate drops below threshold
      # or any case tagged "critical" regresses

This is the whole point: evaluation gives an LLM app the same safety net unit tests give normal code.

Minimal vs mature

Ship the minimal column to get a signal today; the mature column is what makes it trustworthy.

Aspect	Minimal	Production-grade
Dataset	20–50 real cases	100s, growing from every incident
Scoring	`must_include` / exclude	+ LLM-as-judge, calibrated
Coverage	Top intents	Sliced by `risk_category`
Cadence	Run before a big change	Gated in CI on every change
Regressions	Spotted by hand	Block the deploy automatically

Evaluating RAG and agents

RAG: separate retrieval quality from generation quality - most “wrong answer” bugs are retrieval bugs. Measure whether the right context was fetched before you judge the answer. See RAGOps.
Agents: evaluate at the step level, not just the final answer. Use the trace to check the agent chose the right tool with the right arguments - a correct final answer can still hide a wrong, expensive path.

Tools and where to go next

You don’t have to build the harness yourself - the evals category of the directory covers Braintrust, LangSmith, OpenAI Evals and Giskard, filterable by open-source, self-hostable and more.

Then turn it into a habit: the evaluation items in the Production Checklist are the bar to clear, and the Maturity Score tells you how your eval practice compares across the rest of the stack. For a guided route in, the learning page lists a free short course on evaluating and debugging generative AI.

LLM Evaluation: How to test prompts, RAG and agents

The two questions#

What breaks in production without it#

What to measure#

Instrument it: the eval row#

Instrument it: the LLM-as-judge#

Run it like a test suite#

Minimal vs mature#

Evaluating RAG and agents#

Tools and where to go next#

How to build your first eval dataset

How to calculate hallucination rate

LLMOps vs MLOps: What changes with large language models?