LLM Observability: What to monitor in production

A production guide to LLM observability - the signals that matter, how to instrument with OpenTelemetry or Langfuse, what breaks without it, and a minimal-vs-mature path.

Observability is the difference between knowing a request was slow and knowing which retrieval step, tool call or model hop caused it. It’s the data layer the rest of LLMOps depends on - you can’t evaluate, control cost or run an incident without it.

The two questions

Prototype: “Did it work when I tried it?” Production: “When a request misbehaves at 2am, can I see exactly why - and is p95 latency holding under real load?”

In the demo you watch one request succeed. In production you need to reconstruct any request after the fact, and watch aggregate health across all of them.

What breaks without it

Every issue is a guess. A user reports a bad answer; with no record of what the model saw, you’re debugging blind.
Multi-step systems are opaque. An agent returns the wrong result and you can’t tell which of six steps went wrong.
Slow creep goes unseen. p95 latency drifts up with traffic; averages hide it until users churn.
Cost is a quarterly surprise. Without per-request token accounting you learn about a runaway feature from the invoice.
Evals have no raw material. The best eval cases are real production failures - which you can only mine if you logged them.

The signals that matter

Request / response logs - every input and output, redacted as needed. You cannot debug what you did not record.
Traces - for chains and agents, a tree of spans following one request across every retrieval, tool call and model hop. See what to log in an LLM trace for the field-by-field schema.
Latency - p50 and p95, ideally split by step (retrieval vs generation), not just an average.
Token usage - per request and per feature, feeding straight into cost control.
Failure modes - timeouts, refusals, malformed output, tool errors, empty/low-score retrieval.
Outcome signals - user feedback (thumbs, escalations) tied back to the trace.

Instrument it

The cheapest path is a drop-in SDK. Langfuse, for example, captures latency, tokens and cost on every call with no extra code:

# pip install langfuse  -  drop-in replacement for the OpenAI client
from langfuse.openai import OpenAI

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": question}],
)
# Latency, token usage and cost now land in your trace automatically.

If you want vendor-neutral telemetry that lives in your existing stack, follow the OpenTelemetry GenAI semantic conventions (linked below) - a standard set of gen_ai.* span attributes:

span: chat gpt-4.1-mini
  gen_ai.system            = "openai"
  gen_ai.request.model     = "gpt-4.1-mini"
  gen_ai.usage.input_tokens  = 1240
  gen_ai.usage.output_tokens = 310
  gen_ai.response.finish_reasons = ["stop"]

Standardising on these attributes means your traces, metrics and dashboards aren’t locked to one vendor.

From signal to action

Signals you don’t act on are decoration:

Alert on the things that predict user pain first - error-rate spikes and p95 latency regressions.
Tie traces to evals - pipe real failures into your eval set so the same failure becomes a permanent test.
Make it replayable - being able to re-run a production request with its exact inputs turns a vague report into a five-minute fix.

Minimal vs mature

Aspect	Minimal	Production-grade
Capture	Request/response logs	Full span traces (OTel / SDK)
Latency	Average	p50/p95, split by step
Cost	Total spend	Per request, per feature
Alerts	None	Error-rate + latency regressions
Debugging	Read logs	Search + replay a single trace
Feedback loop	-	Failures flow into the eval set

Tools and where to go next

The observability category of the directory covers Langfuse, LangSmith, Helicone and Arize Phoenix - filterable by open-source, self-hostable and OpenTelemetry support. Once you can see your system, close the loop with evaluation and pressure-test the rest with the Production Checklist.

LLM Observability: What to monitor in production

The two questions#

What breaks without it#

The signals that matter#

Instrument it#

From signal to action#

Minimal vs mature#

Tools and where to go next#

What to log in an LLM trace

How to choose between Langfuse, LangSmith, Braintrust and Helicone

RAG freshness monitoring checklist