RAGOps: How to monitor retrieval quality
Most "the model is wrong" bugs are retrieval bugs. A production guide to measuring retrieval quality, tuning chunking and embeddings, keeping the index fresh, and failing safe when nothing matches.
Retrieval-augmented generation grounds a model’s answer in documents you retrieve at query time - the pattern introduced by Lewis et al. (cited below). In production it fails in a specific, recurring way: the model gets blamed for what is really a retrieval problem. RAGOps is the practice of measuring and maintaining the retrieval half of the system.
Everything before Generate is retrieval - and that is where most RAG bugs live. The rest of this guide measures and hardens each stage.
The two questions
Prototype: “Does RAG work on the query I just tried?” Production: “Is the right context retrieved - measured - and is the index still true?”
A demo retrieves for one question. Production has to retrieve the right context across every query, and keep the corpus current as reality changes underneath it.
What breaks in production
- Retrieval bugs blamed on the model. The answer is wrong because the wrong chunks were fetched (or the right ones were missing) - no model swap fixes that.
- Stale index, confident answers. The corpus moved on; the system serves outdated policy or pricing with full confidence.
- Bad chunking. Chunks split a key fact across two pieces, so neither is retrieved whole.
- Embedding drift. A new embedding model or shifting data changes what’s “similar,” silently degrading recall.
- No fallback. Nothing relevant matches, so the model hallucinates instead of saying “I don’t have that.”
Measure retrieval, not just answers
Separate two questions and measure them independently:
- Did we retrieve the right context? - retrieval quality.
- Did the model use it well? - generation quality (faithfulness).
For retrieval, score against a small labelled set (query → the doc ids that should be retrieved):
def recall_at_k(eval_set, retrieve, k=5):
hits = 0
for case in eval_set: # case = {"query", "relevant_ids"}
got = {d.id for d in retrieve(case["query"], k=k)}
if got & set(case["relevant_ids"]):
hits += 1
return hits / len(eval_set) # share of queries that found a relevant doc
Now chunking strategy, embedding model and k become evaluated choices, not
guesses - compare them by their effect on recall_at_k before you touch the
prompt.
Fail safe when nothing matches
Most “confident hallucination” in RAG is an empty-retrieval problem. This is the grounding gate in the diagram above - gate on the retrieval score:
docs = retrieve(query, k=5)
if not docs or docs[0].score < MIN_SCORE:
return fallback() # ask to rephrase, or hand off - don't invent
context = format_context(docs)
Log top_score, retrieved ids and an index_lag signal on every request so you
can see retrieval health, not just final answers.
Keep the index fresh
Retrieval quality asks “did we fetch the right context?”; freshness asks “is it still true?” Both need monitoring - a refresh schedule, index-lag and chunk-age signals, deletion propagation, and drift detection. The RAG freshness monitoring checklist covers this field by field.
Minimal vs mature
| Aspect | Minimal | Production-grade |
|---|---|---|
| Retrieval | Top-k vector search | Hybrid search + reranker |
| Quality | Judge final answers | Measure recall/precision separately |
| Chunking | Fixed guess | Evaluated against the eval set |
| Freshness | Manual re-index | Scheduled + lag/drift alerts |
| No match | Model guesses | Safe fallback on low score |
Where this lives in a real system
See the RAG chatbot reference architecture for where the retriever, reranker and grounding check sit, the RAG / vector tools for the infrastructure, and the RAG items in the Production Checklist for the bar to clear.