How to build your first eval dataset

A practical, step-by-step guide to building an LLM eval dataset from real traffic - what a row looks like, how to score it, how many cases you need, and how to wire it into CI.

An eval dataset is the single highest-leverage thing you can build in LLMOps, and the one teams most often put off. The good news: your first useful version takes an afternoon, not a sprint. This is how to build it.

Start from real failures, not imagination

The best eval cases are the failures you’ve already seen. Don’t sit down and invent test questions - mine them:

Pull recent production traces and filter for thumbs-down, escalations, retries or low-confidence answers.
Grab the handful of bug reports where someone said “the bot got this wrong.”
Ask support or sales for the questions that trip it up.

Twenty real, painful cases beat two hundred imagined ones.

What a row looks like

Keep each row declarative - the input, the expected behaviour, and machine-checkable assertions. A format that works well:

{
  "input": "Can I get a refund after 45 days?",
  "expected_behavior": "Answer using refund policy only",
  "must_include": ["policy window", "support escalation"],
  "must_not_include": ["invented exceptions"],
  "risk_category": "customer_support_policy"
}

The must_include / must_not_include lists are the trick: they turn a fuzzy “is this answer good?” into concrete, automatable checks you can run on every change. risk_category lets you slice results later (“how are we doing on policy questions?”).

How many cases?

Enough to be representative, not exhaustive. A useful progression:

20–50 cases to start - covering your top intents and known failure modes.
~100–200 once you’re gating releases on it.
Grow it every time production surfaces a new failure. The eval set is a living artifact, not a one-time deliverable.

Bias toward diversity over volume: one example each of ten different failure shapes is worth more than fifty variations of the same one.

Three ways to score

Deterministic checks - must_include / must_not_include, regex, JSON schema validation, exact match for structured outputs. Cheap, fast, no model needed. Use these wherever you can.
LLM-as-judge - a second model scores the answer against a rubric (“Is it grounded in the provided context? Yes/No”). Use for subjective qualities like helpfulness, tone and faithfulness.
Human review - for a small, high-value slice, or to calibrate your LLM-judge. Don’t scale this; use it to validate the cheaper methods.

Wire it into CI

An eval set that runs manually gets skipped. Make it a gate:

on:  prompt change | model change | retrieval change
run: eval suite over the dataset
fail the build if:
  - pass rate drops below threshold, or
  - any "critical" case regresses

Treat it exactly like a unit-test suite - because that’s what it is, for a non-deterministic system. See LLM Evaluation for the broader practice, and the Tools directory for platforms (Braintrust, LangSmith, Giskard, OpenAI Evals) that run the suite for you.

Common pitfalls

Over-fitting to the eval set. If you only ever tune against it, you’ll game it. Keep adding fresh production cases.
All happy-path cases. Your eval set should be mostly the hard and adversarial cases - that’s where regressions hide.
One global pass rate. Slice by risk_category and intent; an 85% average can hide a 40% pass rate on the category that matters most.
No baseline. Record the current score before you change anything, or you can’t tell whether a change helped.

Your afternoon plan

Export 30 recent production cases, weighted toward failures.
For each, write expected_behavior and the include/exclude assertions.
Run them against your current prompt - that’s your baseline.
Add the suite to your change process so it runs on the next edit.

That’s a real eval set. From here, every prompt and model change becomes a measured decision instead of a guess. When you’re ready to see where else you stand, take the Maturity Score or work the Production Checklist.

How to build your first eval dataset

Start from real failures, not imagination#

What a row looks like#

How many cases?#

Three ways to score#

Wire it into CI#

Common pitfalls#

Your afternoon plan#

How to calculate hallucination rate

LLM Evaluation: How to test prompts, RAG and agents

LLMOps vs MLOps: What changes with large language models?