LLMOps aka Large Language Model Operations

LLMOps (Large Language Model Operations) is the discipline of running LLM applications in production: evaluation, observability, prompt management, cost control, RAG operations, security, governance and deployment. It is the set of practices that turns a working prototype into a system you can run, measure and trust. See the LLMOps Stack.

Related: MLOps · Evals · LLM observability

LLM observability

LLM observability is the ability to see what an LLM application is doing in production - capturing every request and response, tracing multi-step chains, and tracking latency, token usage and failure modes. It is the data foundation the other LLMOps layers depend on. See what to monitor.

Related: Agent tracing · Token cost · Hallucination rate

Prompt versioning aka Prompt management

Prompt versioning treats prompts like code: every change is tracked with an author and reason, tested against evals before release, and revertible in one step. It is what lets you reproduce a regression or attribute a quality change to a specific edit. See treat prompts like code.

Related: Evals · LLMOps

Evals aka Evaluation, Eval dataset

Evals are the LLM equivalent of a test suite: a curated dataset plus scoring that measures accuracy, faithfulness, safety and relevance. Running them on every change - and blocking releases on regressions - turns “it feels better” into a defensible number. See how to evaluate.

Related: RAG evaluation · Hallucination rate · Prompt versioning

RAG evaluation aka RAGOps

RAG evaluation separates two questions: did the system retrieve the right context, and did the model use it well? Measuring retrieval precision and recall against a labelled set catches the most common RAG failure - the wrong or missing context - before you blame the model. See RAGOps.

Related: Evals · Embeddings · Vector database

Model routing

Model routing sends easy traffic to smaller, cheaper models and reserves larger models for the requests that genuinely need them. Done well it cuts cost substantially with little quality loss; done blindly it quietly degrades output - which is why routing changes belong in your evals. See cost control.

Related: Token cost · LLMOps

Token cost aka Token spend

Token cost is what you pay per request, calculated from input and output token counts times the model’s per-million-token rates. Because it scales with traffic and context length, it can compound quietly - model it with the Cost Calculator and put alerts on it before the budget breaks.

Related: Model routing · Context window

Hallucination rate aka Faithfulness

Hallucination rate measures how often the model states something unsupported by the facts or the provided context. Its inverse, faithfulness, is a core eval metric - especially for RAG, where the test is whether the answer is grounded in what was retrieved. Track it as a first-class number, not an anecdote.

Related: Evals · RAG evaluation

Prompt injection

Prompt injection exploits the fact that an LLM follows instructions it finds in its input - including instructions an attacker plants in a document, web page or user message. It is the signature LLM security threat. Defend by treating all input as untrusted, scoping tool access, and validating outputs. See LLM security.

Related: Guardrails · Human-in-the-loop

Human-in-the-loop aka HITL

Human-in-the-loop (HITL) makes human review a deliberate step in the system rather than an afterthought - an approval gate for high-risk actions, or a correction loop whose feedback improves future behaviour. It is a core governance control and a source of eval labels.

Related: Guardrails · LLMOps

Context window

Context window is the token budget for a single request - everything the model can “see” at once: system prompt, history, retrieved context and the user message. Larger windows enable more context but raise token cost and can dilute attention, which is why RAG retrieves the relevant slice rather than dumping everything in.

Related: Token cost · Embeddings

Embeddings

Embeddings turn text into vectors so that semantically similar pieces sit near each other in vector space. They are the backbone of retrieval: a query is embedded and matched against stored embeddings to find relevant context. The choice of embedding model is an evaluable decision, not a default - see RAGOps.

Related: Vector database · RAG evaluation

Vector database aka Vector store

Vector database stores embeddings and retrieves the nearest matches to a query vector quickly, often alongside keyword search (hybrid retrieval). It is the retrieval engine of a RAG system. Options including Pinecone, Weaviate, Qdrant and Redis are listed under RAG / vector.

Related: Embeddings · RAG evaluation

Agent tracing

Agent tracing records the full path of a request through an agent: each model call, tool invocation, retrieval and decision, with inputs, outputs and timing. It is what makes a multi-step agent debuggable - without it, a wrong final answer gives no clue which step failed. A core part of observability.

Related: LLM observability · Evals