Pillar guide

The LLMOps Stack

A reliable LLM application is rarely a model problem. It is an operations problem spread across eight layers - each one a place a production system can quietly fail.

PROTOTYPE → PRODUCTION 08 Deployment 07 Governance 06 Security 05 RAG operations 04 Cost control 03 Observability 02 Evaluation 01 Prompt management
01 Prompt management

Versions, change testing and rollback. Treat prompts like code.

Prompt versioningChange testingRollbackA/B comparison

Prompts are production code. Without versioning you cannot reproduce a regression, attribute a quality change, or roll back a bad edit. Track every prompt and config change, test it against an eval set before it ships, and keep the ability to revert in seconds.

Read the deep dive
02 Evaluation

Accuracy, hallucination, safety and relevance - measured, not guessed.

Eval datasetsLLM-as-judgeRegression testingSafety checks

You cannot improve what you do not measure. An eval dataset turns "it feels better" into a number you can defend. Run it on every prompt, model and retrieval change, and gate releases on it the same way you gate code on tests.

Read the deep dive
LLM-as-judge faithfulness check (Python)
import json

def faithfulness_judge(answer: str, context: str) -> dict:
    prompt = (
        "Using ONLY the context, is every claim in the answer supported?\n"
        'Reply JSON: {"verdict": "supported|unsupported", "unsupported": [...]}\n\n'
        f"Context:\n{context}\n\nAnswer:\n{answer}"
    )
    out = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(out.choices[0].message.content)
03 Observability

Logs, traces, latency, token usage and failure modes in production.

Request/response logsTracingLatency & token metricsFailure analysis

Observability is the difference between knowing a request was slow and knowing which retrieval step, tool call or model hop caused it. Log every request and response, trace multi-step chains end to end, and watch latency, cost and failure modes as first-class metrics.

Read the deep dive
Instrument a call with Langfuse (Python)
# pip install langfuse  -  drop-in replacement for the OpenAI client
from langfuse.openai import OpenAI

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": question}],
)
# Latency, token usage and cost are now captured per call - no extra code.
04 Cost control

Token spend, caching, routing and right-sized model selection.

Token accountingPrompt cachingModel routingBudget alerts

Token spend compounds quietly until a finance review forces a panic. Account for cost per request, cache what repeats, route easy traffic to smaller models, and put alerts on the budget before it is breached - not after.

Read the deep dive
05 RAG operations

Retrieval quality, chunking, embeddings and index freshness.

Retrieval qualityChunking strategyEmbedding driftIndex freshness

Most "the model is wrong" bugs are really retrieval bugs. Measure whether the right context was fetched, tune chunking and embeddings against that signal, and keep the index fresh so answers do not quietly go stale.

Read the deep dive
06 Security

Prompt injection, data leakage and access control at every boundary.

Prompt injectionData leakage / PIIAccess controlGuardrails

An LLM with tools and data access is an attack surface. Treat every input as untrusted, scan for prompt injection and PII leakage, and enforce access control at each boundary - retrieval, tools and output alike.

Read the deep dive
07 Governance

Audit trails, approvals and compliance you can show an auditor.

Audit trailsApprovalsComplianceHuman-in-the-loop

When a regulator or customer asks "who approved this and what did it do," you need an answer in records, not memory. Keep audit trails, define approval gates for high-risk use cases, and make human review a designed step rather than an afterthought.

Read the deep dive
08 Deployment

CI/CD for LLM apps, staging and production environments.

CI/CD for promptsStaging environmentsProgressive rolloutRollback

Shipping an LLM change is a deploy like any other - it deserves CI, a staging environment and a rollback plan. Wire evals into the pipeline, roll out progressively, and make reverting a one-step operation.

Read the deep dive