How to calculate hallucination rate

A practical method for measuring LLM hallucination (faithfulness) rate in production - how to define it, sample it, judge it, and track it over time.

“The model hallucinates” is an anecdote. “Our hallucination rate is 3.1% and rose to 5.4% after the last prompt change” is something you can act on. Here’s how to turn the anecdote into a number.

Define it precisely first

Hallucination rate only means something once you pin down what counts. The most useful, measurable definition in production is faithfulness:

An output is unfaithful if it asserts something not supported by the source material it was given (retrieved context, tools, or the prompt).

This is sharper than “is it true?” - you’re measuring grounding, not omniscience. For a RAG system that’s exactly the right question. Decide up front whether you’re scoring per response or per claim (claim-level is stricter and more informative).

The measurement loop

1. Sample N responses from production (with their retrieved context)
2. For each, judge: is every claim supported by the context?
3. hallucination_rate = unsupported_responses / N
4. Track it over time and slice by feature / risk_category

1. Sample representatively

Pull a random sample from real traces - say 100–200 responses - including the context the model actually saw. Random matters: cherry-picked samples flatter you.

2. Judge with a rubric

For each response, the judge (human or LLM-as-judge) answers a narrow question:

{
  "response": "...",
  "context": "...the documents the model was given...",
  "verdict": "supported | partially_supported | unsupported",
  "unsupported_claims": ["specific claim not in context"]
}

An LLM-judge prompt that works: “Given ONLY the context below, is every factual claim in the response supported by it? List any claim that is not.” Keep the judge’s job binary and grounded - don’t ask it to assess truth, only support.

In code, that judge is small - return structured JSON so you can aggregate it:

import json

def faithfulness_judge(answer: str, context: str) -> dict:
    prompt = (
        "Using ONLY the context, is every claim in the answer supported?\n"
        'Reply JSON: {"verdict": "supported|unsupported", "unsupported": [...]}\n\n'
        f"Context:\n{context}\n\nAnswer:\n{answer}"
    )
    out = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(out.choices[0].message.content)

# hallucination_rate = unsupported_count / judged_count

3. Calibrate the judge

Before you trust the LLM-judge, have a human label ~30 of the same cases and compare. If they agree most of the time, automate; if not, tighten the rubric. Recalibrate when you change the judge model.

4. Compute and slice

Headline number aside, slice by feature and risk_category. A 3% overall rate can hide a 15% rate on the one workflow where a wrong answer is expensive.

Turn it into a guardrail

Once you can measure it, you can defend it:

Gate releases on it - block a prompt or model change that pushes the rate up, the same way you’d block on a failing test.
Add a runtime grounding check - a lightweight output check that flags unsupported answers before they reach the user, logged as grounded_answer: false.
Alert when the production rate drifts, which often signals a retrieval or freshness problem upstream rather than the model itself.

Don’t chase zero

Some unfaithfulness is unavoidable, and the cost of driving it to zero (refusing more, retrieving more, slower, pricier) may not be worth it. Pick a target appropriate to the risk: near-zero for regulated or medical use, more relaxed for low-stakes drafting. The point isn’t a perfect score - it’s a number you watch, gate on, and improve deliberately. Build the sampling into your eval dataset and it becomes part of every release.

How to calculate hallucination rate

Define it precisely first#

The measurement loop#

1. Sample representatively#

2. Judge with a rubric#

3. Calibrate the judge#

4. Compute and slice#

Turn it into a guardrail#

Don’t chase zero#

How to build your first eval dataset

LLM Evaluation: How to test prompts, RAG and agents

LLMOps vs MLOps: What changes with large language models?