← All articles
Governance

LLM incident response template

A ready-to-adapt incident response template for LLM applications - severity levels, the first 15 minutes, mitigation levers unique to LLMs, and a post-mortem structure.

June 15, 2026 · 7 min read · governance · incident-response · template

LLM apps fail in ways classical services don’t: a prompt edit tanks quality, a provider ships a model update overnight, an injection makes the agent misbehave, or a traffic spike triples the bill. Your normal incident process mostly applies - but the levers are different. Here’s a template to adapt.

Severity levels

Define these before you need them:

SevExampleResponse
SEV1Harmful output, data leak, or agent taking wrong actions at scaleAll-hands, immediate mitigation, comms
SEV2Quality regression affecting many users; cost runawayOn-call owns, mitigate within the hour
SEV3Degraded latency, elevated error/refusal rateTriage in business hours

The LLM-specific additions to your usual list: harmful/unsafe output, prompt-injection exploitation, hallucination spike, and cost runaway.

The first 15 minutes

1. STOP THE BLEEDING
   - Roll back the last prompt / model / config change (one step - you versioned it)
   - If an agent is taking actions: disable the high-risk tools / force approval mode
   - If cost runaway: enable hard rate limits / budget cap

2. CONFIRM SCOPE
   - Which feature, which prompt_version, since when?
   - Pull traces for affected requests; check what changed (deploy log, provider status)

3. COMMUNICATE
   - Open the incident channel, assign a lead, post a one-line status

The first move is almost always revert to the last known-good version. The ability to do that in one step is why versioning and one-step rollback are non-negotiable in the Production Checklist.

Mitigation levers unique to LLMs

  • Roll back the prompt/model/config - the fastest fix for a quality or behaviour regression.
  • Switch the model / provider - if a provider update or outage is the cause and you have a gateway with fallback.
  • Tighten guardrails - raise input/output filtering, force human-in-the-loop for the affected workflow.
  • Disable tools or enable approval mode - for agents taking wrong actions.
  • Cap cost - hard rate limits and budget alerts to stop a runaway bill.
  • Fail safe - degrade to a canned response or human handoff rather than serving bad output.

What you need in place beforehand

An incident is the wrong time to discover you can’t see anything:

  • Traces with prompt_version, inputs and outputs, searchable.
  • A deploy/change log so you can correlate the incident with a change.
  • A subscription to your providers’ status pages.
  • A documented rollback path that on-call has actually practised.

Post-mortem structure

Blameless, and tuned for LLM causes:

- Summary & impact (who, how many, how long)
- Timeline (change → symptom → detection → mitigation → resolution)
- Root cause: prompt | model/provider | retrieval | data | infra | misuse
- Detection gap: why didn't an eval / alert catch this earlier?
- Action items:
    - Add a regression case to the eval set
    - Add / tune the alert that would have caught it
    - Fix the rollback or guardrail gap

The most valuable output of any LLM incident is a new eval case and a new alert - so the same failure can never silently recur. Tie this into your governance layer and the regulated workflow blueprint if you operate under audit requirements.

Get the Production Checklist Explore the Stack