Reference architecture

RAG chatbot in production

A retrieval-augmented chatbot that answers from your own knowledge base - with the retrieval, guardrails and observability that keep it honest at scale.

01 Architecture

The most common production LLM system, and the one most often shipped without the operational half. The model is the easy part; the failure modes live in retrieval, freshness and injection. This blueprint shows where each control belongs.

User query channel / API

Input guardrail injection · PII

Retriever vector DB + rerank

Prompt assembly versioned template

LLM with prompt cache

Output guardrail grounding check

Response traced + logged

02 When to use it

Use this when

Users ask questions answerable from a known body of documents
You can keep that corpus reasonably fresh
Answers must stay grounded in - or cite - your sources

Reach for something else when

The knowledge changes faster than you can re-index
The task needs multi-step reasoning or tool use (consider an agent)
There is no source of truth to ground answers on

03 Components

What's in the box.

API gateway

Auth, rate limiting and request shaping at the edge.

Input guardrails

Scan for prompt injection and strip/redact PII before retrieval.

Embeddings + vector DB

Stores document embeddings; refreshed on a known schedule.

Retriever + reranker

Fetches and re-orders the most relevant context for the query.

Prompt template

Versioned, tested and rollbackable - not edited in production.

LLM + cache

Generates the answer; caches system prompt and repeated context.

Output guardrails

Check the answer is grounded in retrieved context before sending.

Tracing + evals

Every request logged and traced; eval set runs on changes.

04 Failure modes

Where it breaks - and the fix.

Stale index

Scheduled re-indexing + freshness monitoring; alert on drift.

Wrong context retrieved

Measure retrieval precision/recall against a labelled set, not just final answers.

Hallucination when no context matches

Detect empty/low-score retrieval and return a safe fallback instead of guessing.

Prompt injection via documents

Treat retrieved content as untrusted; input guardrails and output validation.

Cost spike from large context

Cap context size, cache repeated chunks, rerank to fewer high-value passages.

05 Metrics to monitor

What good looks like, measured.

Retrieval precision / recall

Did you fetch the right context at all?
Groundedness / faithfulness

Is the answer supported by retrieved docs?
Index freshness lag

How stale is the content being served?
p95 latency

Retrieval + rerank + generation under real load.
Cost per query

Context size is the main token-spend driver.

06 MVP vs production-grade

Don't build everything on day one.

Ship the MVP column to get to users; the production column is what makes it durable. Choose deliberately which gaps you're leaving.

Aspect	MVP	Production-grade
Retrieval	Top-k vector search	Hybrid search + reranker, evaluated
Freshness	Manual re-index	Scheduled re-index + drift alerts
Grounding	Trust the model	Groundedness check + no-context fallback
Evaluation	Spot checks	Labelled retrieval + answer eval in CI
Cost	Whole corpus in context	Capped context + caching

07 Copy-paste schemas

Instrument it in minutes.

A starting point you can paste into your tracing and eval setup - then adapt to your stack.

Example trace schema

{
  "request_id": "req_123",
  "architecture": "rag-chatbot",
  "prompt_version": "rag-answer-v4",
  "retrieval_query": "refund window for damaged goods",
  "retrieved_documents": 5,
  "top_score": 0.83,
  "grounded_answer": true,
  "input_tokens": 1240,
  "output_tokens": 310,
  "latency_ms": 1840,
  "cost_usd": 0.0042
}

Example eval dataset row

{
  "input": "What's our refund window for damaged goods?",
  "expected_behavior": "Answer only from the retrieved policy documents",
  "must_include": [
    "30-day window",
    "damaged-goods exception"
  ],
  "must_not_include": [
    "invented policy",
    "unsupported timeframes"
  ],
  "risk_category": "grounding"
}

08 Checklist

Ship-ready when…

Retrieval quality is measured separately from final answers
The index is refreshed on a known schedule with freshness monitoring
There is a fallback when retrieval returns nothing relevant
Retrieved content is treated as untrusted (injection tested)
Context size and token spend are capped and cached
Every request is traced and replayable

Full production checklist → Score your maturity →

09 Related

Stack layers

RAG operations Observability Security Cost control

Deep dives

Ragops How To Monitor Retrieval Quality →Llm Observability What To Monitor In Production →