← All articles
RAG operations

RAGOps: How to monitor retrieval quality

Most "the model is wrong" bugs are retrieval bugs. A production guide to measuring retrieval quality, tuning chunking and embeddings, keeping the index fresh, and failing safe when nothing matches.

May 30, 2026 · updated June 8, 2026 · 10 min read · rag · retrieval · embeddings

Retrieval-augmented generation grounds a model’s answer in documents you retrieve at query time - the pattern introduced by Lewis et al. (cited below). In production it fails in a specific, recurring way: the model gets blamed for what is really a retrieval problem. RAGOps is the practice of measuring and maintaining the retrieval half of the system.

Everything before Generate is retrieval - and that is where most RAG bugs live. The rest of this guide measures and hardens each stage.

The two questions

Prototype: “Does RAG work on the query I just tried?” Production: “Is the right context retrieved - measured - and is the index still true?”

A demo retrieves for one question. Production has to retrieve the right context across every query, and keep the corpus current as reality changes underneath it.

What breaks in production

  • Retrieval bugs blamed on the model. The answer is wrong because the wrong chunks were fetched (or the right ones were missing) - no model swap fixes that.
  • Stale index, confident answers. The corpus moved on; the system serves outdated policy or pricing with full confidence.
  • Bad chunking. Chunks split a key fact across two pieces, so neither is retrieved whole.
  • Embedding drift. A new embedding model or shifting data changes what’s “similar,” silently degrading recall.
  • No fallback. Nothing relevant matches, so the model hallucinates instead of saying “I don’t have that.”

Measure retrieval, not just answers

Separate two questions and measure them independently:

  1. Did we retrieve the right context? - retrieval quality.
  2. Did the model use it well? - generation quality (faithfulness).

For retrieval, score against a small labelled set (query → the doc ids that should be retrieved):

def recall_at_k(eval_set, retrieve, k=5):
    hits = 0
    for case in eval_set:                     # case = {"query", "relevant_ids"}
        got = {d.id for d in retrieve(case["query"], k=k)}
        if got & set(case["relevant_ids"]):
            hits += 1
    return hits / len(eval_set)                # share of queries that found a relevant doc

Now chunking strategy, embedding model and k become evaluated choices, not guesses - compare them by their effect on recall_at_k before you touch the prompt.

Fail safe when nothing matches

Most “confident hallucination” in RAG is an empty-retrieval problem. This is the grounding gate in the diagram above - gate on the retrieval score:

docs = retrieve(query, k=5)
if not docs or docs[0].score < MIN_SCORE:
    return fallback()        # ask to rephrase, or hand off - don't invent
context = format_context(docs)

Log top_score, retrieved ids and an index_lag signal on every request so you can see retrieval health, not just final answers.

Keep the index fresh

Retrieval quality asks “did we fetch the right context?”; freshness asks “is it still true?” Both need monitoring - a refresh schedule, index-lag and chunk-age signals, deletion propagation, and drift detection. The RAG freshness monitoring checklist covers this field by field.

Minimal vs mature

AspectMinimalProduction-grade
RetrievalTop-k vector searchHybrid search + reranker
QualityJudge final answersMeasure recall/precision separately
ChunkingFixed guessEvaluated against the eval set
FreshnessManual re-indexScheduled + lag/drift alerts
No matchModel guessesSafe fallback on low score

Where this lives in a real system

See the RAG chatbot reference architecture for where the retriever, reranker and grounding check sit, the RAG / vector tools for the infrastructure, and the RAG items in the Production Checklist for the bar to clear.

Get the Production Checklist Explore the Stack