RAG chatbot in production
A retrieval-augmented chatbot that answers from your own knowledge base - with the retrieval, guardrails and observability that keep it honest at scale.
The most common production LLM system, and the one most often shipped without the operational half. The model is the easy part; the failure modes live in retrieval, freshness and injection. This blueprint shows where each control belongs.
Use this when
- Users ask questions answerable from a known body of documents
- You can keep that corpus reasonably fresh
- Answers must stay grounded in - or cite - your sources
Reach for something else when
- The knowledge changes faster than you can re-index
- The task needs multi-step reasoning or tool use (consider an agent)
- There is no source of truth to ground answers on
What's in the box.
API gateway
Auth, rate limiting and request shaping at the edge.
Input guardrails
Scan for prompt injection and strip/redact PII before retrieval.
Embeddings + vector DB
Stores document embeddings; refreshed on a known schedule.
Retriever + reranker
Fetches and re-orders the most relevant context for the query.
Prompt template
Versioned, tested and rollbackable - not edited in production.
LLM + cache
Generates the answer; caches system prompt and repeated context.
Output guardrails
Check the answer is grounded in retrieved context before sending.
Tracing + evals
Every request logged and traced; eval set runs on changes.
Where it breaks - and the fix.
What good looks like, measured.
- Retrieval precision / recallDid you fetch the right context at all?
- Groundedness / faithfulnessIs the answer supported by retrieved docs?
- Index freshness lagHow stale is the content being served?
- p95 latencyRetrieval + rerank + generation under real load.
- Cost per queryContext size is the main token-spend driver.
Don't build everything on day one.
Ship the MVP column to get to users; the production column is what makes it durable. Choose deliberately which gaps you're leaving.
| Aspect | MVP | Production-grade |
|---|---|---|
| Retrieval | Top-k vector search | Hybrid search + reranker, evaluated |
| Freshness | Manual re-index | Scheduled re-index + drift alerts |
| Grounding | Trust the model | Groundedness check + no-context fallback |
| Evaluation | Spot checks | Labelled retrieval + answer eval in CI |
| Cost | Whole corpus in context | Capped context + caching |
Instrument it in minutes.
A starting point you can paste into your tracing and eval setup - then adapt to your stack.
{
"request_id": "req_123",
"architecture": "rag-chatbot",
"prompt_version": "rag-answer-v4",
"retrieval_query": "refund window for damaged goods",
"retrieved_documents": 5,
"top_score": 0.83,
"grounded_answer": true,
"input_tokens": 1240,
"output_tokens": 310,
"latency_ms": 1840,
"cost_usd": 0.0042
} {
"input": "What's our refund window for damaged goods?",
"expected_behavior": "Answer only from the retrieved policy documents",
"must_include": [
"30-day window",
"damaged-goods exception"
],
"must_not_include": [
"invented policy",
"unsupported timeframes"
],
"risk_category": "grounding"
} Ship-ready when…
- Retrieval quality is measured separately from final answers
- The index is refreshed on a known schedule with freshness monitoring
- There is a fallback when retrieval returns nothing relevant
- Retrieved content is treated as untrusted (injection tested)
- Context size and token spend are capped and cached
- Every request is traced and replayable