LLM Observability: What to monitor in production
A production guide to LLM observability - the signals that matter, how to instrument with OpenTelemetry or Langfuse, what breaks without it, and a minimal-vs-mature path.
Observability is the difference between knowing a request was slow and knowing which retrieval step, tool call or model hop caused it. It’s the data layer the rest of LLMOps depends on - you can’t evaluate, control cost or run an incident without it.
The two questions
Prototype: “Did it work when I tried it?” Production: “When a request misbehaves at 2am, can I see exactly why - and is p95 latency holding under real load?”
In the demo you watch one request succeed. In production you need to reconstruct any request after the fact, and watch aggregate health across all of them.
What breaks without it
- Every issue is a guess. A user reports a bad answer; with no record of what the model saw, you’re debugging blind.
- Multi-step systems are opaque. An agent returns the wrong result and you can’t tell which of six steps went wrong.
- Slow creep goes unseen. p95 latency drifts up with traffic; averages hide it until users churn.
- Cost is a quarterly surprise. Without per-request token accounting you learn about a runaway feature from the invoice.
- Evals have no raw material. The best eval cases are real production failures - which you can only mine if you logged them.
The signals that matter
- Request / response logs - every input and output, redacted as needed. You cannot debug what you did not record.
- Traces - for chains and agents, a tree of spans following one request across every retrieval, tool call and model hop. See what to log in an LLM trace for the field-by-field schema.
- Latency - p50 and p95, ideally split by step (retrieval vs generation), not just an average.
- Token usage - per request and per feature, feeding straight into cost control.
- Failure modes - timeouts, refusals, malformed output, tool errors, empty/low-score retrieval.
- Outcome signals - user feedback (thumbs, escalations) tied back to the trace.
Instrument it
The cheapest path is a drop-in SDK. Langfuse, for example, captures latency, tokens and cost on every call with no extra code:
# pip install langfuse - drop-in replacement for the OpenAI client
from langfuse.openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": question}],
)
# Latency, token usage and cost now land in your trace automatically.
If you want vendor-neutral telemetry that lives in your existing stack, follow the
OpenTelemetry GenAI semantic conventions (linked below) - a standard set of
gen_ai.* span attributes:
span: chat gpt-4.1-mini
gen_ai.system = "openai"
gen_ai.request.model = "gpt-4.1-mini"
gen_ai.usage.input_tokens = 1240
gen_ai.usage.output_tokens = 310
gen_ai.response.finish_reasons = ["stop"]
Standardising on these attributes means your traces, metrics and dashboards aren’t locked to one vendor.
From signal to action
Signals you don’t act on are decoration:
- Alert on the things that predict user pain first - error-rate spikes and p95 latency regressions.
- Tie traces to evals - pipe real failures into your eval set so the same failure becomes a permanent test.
- Make it replayable - being able to re-run a production request with its exact inputs turns a vague report into a five-minute fix.
Minimal vs mature
| Aspect | Minimal | Production-grade |
|---|---|---|
| Capture | Request/response logs | Full span traces (OTel / SDK) |
| Latency | Average | p50/p95, split by step |
| Cost | Total spend | Per request, per feature |
| Alerts | None | Error-rate + latency regressions |
| Debugging | Read logs | Search + replay a single trace |
| Feedback loop | - | Failures flow into the eval set |
Tools and where to go next
The observability category of the directory covers Langfuse, LangSmith, Helicone and Arize Phoenix - filterable by open-source, self-hostable and OpenTelemetry support. Once you can see your system, close the loop with evaluation and pressure-test the rest with the Production Checklist.