← All reference architectures
Reference architecture

RAG chatbot in production

A retrieval-augmented chatbot that answers from your own knowledge base - with the retrieval, guardrails and observability that keep it honest at scale.

01 Architecture

The most common production LLM system, and the one most often shipped without the operational half. The model is the easy part; the failure modes live in retrieval, freshness and injection. This blueprint shows where each control belongs.

02 When to use it

Use this when

  • Users ask questions answerable from a known body of documents
  • You can keep that corpus reasonably fresh
  • Answers must stay grounded in - or cite - your sources

Reach for something else when

  • The knowledge changes faster than you can re-index
  • The task needs multi-step reasoning or tool use (consider an agent)
  • There is no source of truth to ground answers on
03 Components

What's in the box.

API gateway

Auth, rate limiting and request shaping at the edge.

Input guardrails

Scan for prompt injection and strip/redact PII before retrieval.

Embeddings + vector DB

Stores document embeddings; refreshed on a known schedule.

Retriever + reranker

Fetches and re-orders the most relevant context for the query.

Prompt template

Versioned, tested and rollbackable - not edited in production.

LLM + cache

Generates the answer; caches system prompt and repeated context.

Output guardrails

Check the answer is grounded in retrieved context before sending.

Tracing + evals

Every request logged and traced; eval set runs on changes.

04 Failure modes

Where it breaks - and the fix.

Stale index
Scheduled re-indexing + freshness monitoring; alert on drift.
Wrong context retrieved
Measure retrieval precision/recall against a labelled set, not just final answers.
Hallucination when no context matches
Detect empty/low-score retrieval and return a safe fallback instead of guessing.
Prompt injection via documents
Treat retrieved content as untrusted; input guardrails and output validation.
Cost spike from large context
Cap context size, cache repeated chunks, rerank to fewer high-value passages.
05 Metrics to monitor

What good looks like, measured.

  • Retrieval precision / recall
    Did you fetch the right context at all?
  • Groundedness / faithfulness
    Is the answer supported by retrieved docs?
  • Index freshness lag
    How stale is the content being served?
  • p95 latency
    Retrieval + rerank + generation under real load.
  • Cost per query
    Context size is the main token-spend driver.
06 MVP vs production-grade

Don't build everything on day one.

Ship the MVP column to get to users; the production column is what makes it durable. Choose deliberately which gaps you're leaving.

Aspect MVP Production-grade
Retrieval Top-k vector search Hybrid search + reranker, evaluated
Freshness Manual re-index Scheduled re-index + drift alerts
Grounding Trust the model Groundedness check + no-context fallback
Evaluation Spot checks Labelled retrieval + answer eval in CI
Cost Whole corpus in context Capped context + caching
07 Copy-paste schemas

Instrument it in minutes.

A starting point you can paste into your tracing and eval setup - then adapt to your stack.

Example trace schema
{
  "request_id": "req_123",
  "architecture": "rag-chatbot",
  "prompt_version": "rag-answer-v4",
  "retrieval_query": "refund window for damaged goods",
  "retrieved_documents": 5,
  "top_score": 0.83,
  "grounded_answer": true,
  "input_tokens": 1240,
  "output_tokens": 310,
  "latency_ms": 1840,
  "cost_usd": 0.0042
}
Example eval dataset row
{
  "input": "What's our refund window for damaged goods?",
  "expected_behavior": "Answer only from the retrieved policy documents",
  "must_include": [
    "30-day window",
    "damaged-goods exception"
  ],
  "must_not_include": [
    "invented policy",
    "unsupported timeframes"
  ],
  "risk_category": "grounding"
}
08 Checklist

Ship-ready when…

  • Retrieval quality is measured separately from final answers
  • The index is refreshed on a known schedule with freshness monitoring
  • There is a fallback when retrieval returns nothing relevant
  • Retrieved content is treated as untrusted (injection tested)
  • Context size and token spend are capped and cached
  • Every request is traced and replayable
Full production checklist Score your maturity
09 Related
Stack layers
Deep dives