The LLMOps Stack
A reliable LLM application is rarely a model problem. It is an operations problem spread across eight layers - each one a place a production system can quietly fail.
Versions, change testing and rollback. Treat prompts like code.
Prompts are production code. Without versioning you cannot reproduce a regression, attribute a quality change, or roll back a bad edit. Track every prompt and config change, test it against an eval set before it ships, and keep the ability to revert in seconds.
Accuracy, hallucination, safety and relevance - measured, not guessed.
You cannot improve what you do not measure. An eval dataset turns "it feels better" into a number you can defend. Run it on every prompt, model and retrieval change, and gate releases on it the same way you gate code on tests.
import json
def faithfulness_judge(answer: str, context: str) -> dict:
prompt = (
"Using ONLY the context, is every claim in the answer supported?\n"
'Reply JSON: {"verdict": "supported|unsupported", "unsupported": [...]}\n\n'
f"Context:\n{context}\n\nAnswer:\n{answer}"
)
out = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(out.choices[0].message.content) Logs, traces, latency, token usage and failure modes in production.
Observability is the difference between knowing a request was slow and knowing which retrieval step, tool call or model hop caused it. Log every request and response, trace multi-step chains end to end, and watch latency, cost and failure modes as first-class metrics.
# pip install langfuse - drop-in replacement for the OpenAI client
from langfuse.openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": question}],
)
# Latency, token usage and cost are now captured per call - no extra code. Token spend, caching, routing and right-sized model selection.
Token spend compounds quietly until a finance review forces a panic. Account for cost per request, cache what repeats, route easy traffic to smaller models, and put alerts on the budget before it is breached - not after.
Retrieval quality, chunking, embeddings and index freshness.
Most "the model is wrong" bugs are really retrieval bugs. Measure whether the right context was fetched, tune chunking and embeddings against that signal, and keep the index fresh so answers do not quietly go stale.
Prompt injection, data leakage and access control at every boundary.
An LLM with tools and data access is an attack surface. Treat every input as untrusted, scan for prompt injection and PII leakage, and enforce access control at each boundary - retrieval, tools and output alike.
Audit trails, approvals and compliance you can show an auditor.
When a regulator or customer asks "who approved this and what did it do," you need an answer in records, not memory. Keep audit trails, define approval gates for high-risk use cases, and make human review a designed step rather than an afterthought.
CI/CD for LLM apps, staging and production environments.
Shipping an LLM change is a deploy like any other - it deserves CI, a staging environment and a rollback plan. Wire evals into the pipeline, roll out progressively, and make reverting a one-step operation.