How to calculate hallucination rate
A practical method for measuring LLM hallucination (faithfulness) rate in production - how to define it, sample it, judge it, and track it over time.
“The model hallucinates” is an anecdote. “Our hallucination rate is 3.1% and rose to 5.4% after the last prompt change” is something you can act on. Here’s how to turn the anecdote into a number.
Define it precisely first
Hallucination rate only means something once you pin down what counts. The most useful, measurable definition in production is faithfulness:
An output is unfaithful if it asserts something not supported by the source material it was given (retrieved context, tools, or the prompt).
This is sharper than “is it true?” - you’re measuring grounding, not omniscience. For a RAG system that’s exactly the right question. Decide up front whether you’re scoring per response or per claim (claim-level is stricter and more informative).
The measurement loop
1. Sample N responses from production (with their retrieved context)
2. For each, judge: is every claim supported by the context?
3. hallucination_rate = unsupported_responses / N
4. Track it over time and slice by feature / risk_category
1. Sample representatively
Pull a random sample from real traces - say 100–200 responses - including the context the model actually saw. Random matters: cherry-picked samples flatter you.
2. Judge with a rubric
For each response, the judge (human or LLM-as-judge) answers a narrow question:
{
"response": "...",
"context": "...the documents the model was given...",
"verdict": "supported | partially_supported | unsupported",
"unsupported_claims": ["specific claim not in context"]
}
An LLM-judge prompt that works: “Given ONLY the context below, is every factual claim in the response supported by it? List any claim that is not.” Keep the judge’s job binary and grounded - don’t ask it to assess truth, only support.
In code, that judge is small - return structured JSON so you can aggregate it:
import json
def faithfulness_judge(answer: str, context: str) -> dict:
prompt = (
"Using ONLY the context, is every claim in the answer supported?\n"
'Reply JSON: {"verdict": "supported|unsupported", "unsupported": [...]}\n\n'
f"Context:\n{context}\n\nAnswer:\n{answer}"
)
out = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(out.choices[0].message.content)
# hallucination_rate = unsupported_count / judged_count
3. Calibrate the judge
Before you trust the LLM-judge, have a human label ~30 of the same cases and compare. If they agree most of the time, automate; if not, tighten the rubric. Recalibrate when you change the judge model.
4. Compute and slice
Headline number aside, slice by feature and risk_category. A 3% overall rate can
hide a 15% rate on the one workflow where a wrong answer is expensive.
Turn it into a guardrail
Once you can measure it, you can defend it:
- Gate releases on it - block a prompt or model change that pushes the rate up, the same way you’d block on a failing test.
- Add a runtime grounding check - a lightweight output check that flags
unsupported answers before they reach the user, logged as
grounded_answer: false. - Alert when the production rate drifts, which often signals a retrieval or freshness problem upstream rather than the model itself.
Don’t chase zero
Some unfaithfulness is unavoidable, and the cost of driving it to zero (refusing more, retrieving more, slower, pricier) may not be worth it. Pick a target appropriate to the risk: near-zero for regulated or medical use, more relaxed for low-stakes drafting. The point isn’t a perfect score - it’s a number you watch, gate on, and improve deliberately. Build the sampling into your eval dataset and it becomes part of every release.