← All articles
Evaluation

How to build your first eval dataset

A practical, step-by-step guide to building an LLM eval dataset from real traffic - what a row looks like, how to score it, how many cases you need, and how to wire it into CI.

June 20, 2026 · 8 min read · evaluation · how-to · evals

An eval dataset is the single highest-leverage thing you can build in LLMOps, and the one teams most often put off. The good news: your first useful version takes an afternoon, not a sprint. This is how to build it.

Start from real failures, not imagination

The best eval cases are the failures you’ve already seen. Don’t sit down and invent test questions - mine them:

  • Pull recent production traces and filter for thumbs-down, escalations, retries or low-confidence answers.
  • Grab the handful of bug reports where someone said “the bot got this wrong.”
  • Ask support or sales for the questions that trip it up.

Twenty real, painful cases beat two hundred imagined ones.

What a row looks like

Keep each row declarative - the input, the expected behaviour, and machine-checkable assertions. A format that works well:

{
  "input": "Can I get a refund after 45 days?",
  "expected_behavior": "Answer using refund policy only",
  "must_include": ["policy window", "support escalation"],
  "must_not_include": ["invented exceptions"],
  "risk_category": "customer_support_policy"
}

The must_include / must_not_include lists are the trick: they turn a fuzzy “is this answer good?” into concrete, automatable checks you can run on every change. risk_category lets you slice results later (“how are we doing on policy questions?”).

How many cases?

Enough to be representative, not exhaustive. A useful progression:

  • 20–50 cases to start - covering your top intents and known failure modes.
  • ~100–200 once you’re gating releases on it.
  • Grow it every time production surfaces a new failure. The eval set is a living artifact, not a one-time deliverable.

Bias toward diversity over volume: one example each of ten different failure shapes is worth more than fifty variations of the same one.

Three ways to score

  1. Deterministic checks - must_include / must_not_include, regex, JSON schema validation, exact match for structured outputs. Cheap, fast, no model needed. Use these wherever you can.
  2. LLM-as-judge - a second model scores the answer against a rubric (“Is it grounded in the provided context? Yes/No”). Use for subjective qualities like helpfulness, tone and faithfulness.
  3. Human review - for a small, high-value slice, or to calibrate your LLM-judge. Don’t scale this; use it to validate the cheaper methods.

Wire it into CI

An eval set that runs manually gets skipped. Make it a gate:

on:  prompt change | model change | retrieval change
run: eval suite over the dataset
fail the build if:
  - pass rate drops below threshold, or
  - any "critical" case regresses

Treat it exactly like a unit-test suite - because that’s what it is, for a non-deterministic system. See LLM Evaluation for the broader practice, and the Tools directory for platforms (Braintrust, LangSmith, Giskard, OpenAI Evals) that run the suite for you.

Common pitfalls

  • Over-fitting to the eval set. If you only ever tune against it, you’ll game it. Keep adding fresh production cases.
  • All happy-path cases. Your eval set should be mostly the hard and adversarial cases - that’s where regressions hide.
  • One global pass rate. Slice by risk_category and intent; an 85% average can hide a 40% pass rate on the category that matters most.
  • No baseline. Record the current score before you change anything, or you can’t tell whether a change helped.

Your afternoon plan

  1. Export 30 recent production cases, weighted toward failures.
  2. For each, write expected_behavior and the include/exclude assertions.
  3. Run them against your current prompt - that’s your baseline.
  4. Add the suite to your change process so it runs on the next edit.

That’s a real eval set. From here, every prompt and model change becomes a measured decision instead of a guess. When you’re ready to see where else you stand, take the Maturity Score or work the Production Checklist.

Get the Production Checklist Explore the Stack