What to log in an LLM trace

A field-by-field guide to what belongs in a production LLM trace - request IDs, prompt versions, retrieval, tokens, latency, cost and outcome - plus what to redact.

When something goes wrong in production, your trace is the only witness. The difference between “the bot gave a bad answer” and a five-minute diagnosis is whether you logged the right fields. Here’s what belongs in an LLM trace.

The minimum viable trace

Even before full distributed tracing, log a structured record per request:

{
  "request_id": "req_123",
  "architecture": "rag-chatbot",
  "prompt_version": "rag-answer-v4",
  "retrieval_query": "refund window for damaged goods",
  "retrieved_documents": 5,
  "top_score": 0.83,
  "grounded_answer": true,
  "input_tokens": 1240,
  "output_tokens": 310,
  "latency_ms": 1840,
  "cost_usd": 0.0042
}

Each field earns its place by answering a question you’ll have during an incident.

Capture it in code

With a tracing SDK this is nearly free. Langfuse’s drop-in client, for instance, records latency, tokens and cost on every call automatically:

# pip install langfuse  -  drop-in replacement for the OpenAI client
from langfuse.openai import OpenAI

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": question}],
)
# Latency, token usage and cost now land in your trace - no extra code.

Field by field - and why

request_id - correlate logs, traces and the user report. Non-negotiable.
prompt_version - which prompt produced this? Without it you can’t tie a quality change to a prompt edit.
model + parameters - model name, temperature, max tokens. Provider updates and config drift are real failure modes.
Inputs and outputs - the actual prompt sent and response received, redacted as needed (see below). You cannot debug what you didn’t capture.
Retrieval - the query, number and IDs of retrieved docs, and top similarity score. Most “wrong answer” bugs are retrieval bugs.
grounded_answer - did an output check confirm the answer was supported by context? Feeds your hallucination rate.
Tokens + cost - input_tokens, output_tokens, cost_usd. This is your cost control data, per request and per feature.
latency_ms - total, and ideally per step (retrieval vs generation).
Outcome - error codes, refusals, retries, and any user feedback signal.

Trace multi-step chains, don’t just log

For agents and chains, a flat log isn’t enough - you need a trace: a tree of spans where each tool call, retrieval and model hop is its own timed span with inputs and outputs. When an agent gives a wrong final answer, the trace shows it called the wrong tool at step 2, so you fix the tool-selection prompt instead of guessing.

A simple span shape:

{
  "span_id": "sp_07",
  "parent_id": "sp_03",
  "type": "tool_call",
  "name": "lookup_order",
  "input": { "order_id": "..." },
  "output": { "status": "shipped" },
  "latency_ms": 120,
  "error": null
}

What to redact

Logging everything is a security liability. Before storage:

Redact or tokenise PII in prompts, context and outputs per your policy.
Keep secrets and keys out of the trace entirely.
Scope who can read traces - they often contain sensitive customer data.
Set a retention policy; don’t keep raw inputs forever by default.

Make it queryable

Logs you can’t search are decoration. Ship traces somewhere you can filter by prompt_version, risk_category, latency and cost - and alert on regressions. The observability tools (Langfuse, LangSmith, Helicone, Arize Phoenix) do this out of the box, and several speak OpenTelemetry if you want to keep it in your existing stack.

The test

Pick a recent production complaint. Can you pull up exactly what the model saw, which prompt version produced it, what was retrieved, and what it cost - in under a minute? If yes, your tracing is doing its job. If not, this list is your backlog. The RAG chatbot reference architecture shows where the trace tap sits in a full system.

What to log in an LLM trace

The minimum viable trace#

Capture it in code#

Field by field - and why#

Trace multi-step chains, don’t just log#

What to redact#

Make it queryable#

The test#

LLM Observability: What to monitor in production

How to choose between Langfuse, LangSmith, Braintrust and Helicone

How to build your first eval dataset