LLM incident response template
A ready-to-adapt incident response template for LLM applications - severity levels, the first 15 minutes, mitigation levers unique to LLMs, and a post-mortem structure.
LLM apps fail in ways classical services don’t: a prompt edit tanks quality, a provider ships a model update overnight, an injection makes the agent misbehave, or a traffic spike triples the bill. Your normal incident process mostly applies - but the levers are different. Here’s a template to adapt.
Severity levels
Define these before you need them:
| Sev | Example | Response |
|---|---|---|
| SEV1 | Harmful output, data leak, or agent taking wrong actions at scale | All-hands, immediate mitigation, comms |
| SEV2 | Quality regression affecting many users; cost runaway | On-call owns, mitigate within the hour |
| SEV3 | Degraded latency, elevated error/refusal rate | Triage in business hours |
The LLM-specific additions to your usual list: harmful/unsafe output, prompt-injection exploitation, hallucination spike, and cost runaway.
The first 15 minutes
1. STOP THE BLEEDING
- Roll back the last prompt / model / config change (one step - you versioned it)
- If an agent is taking actions: disable the high-risk tools / force approval mode
- If cost runaway: enable hard rate limits / budget cap
2. CONFIRM SCOPE
- Which feature, which prompt_version, since when?
- Pull traces for affected requests; check what changed (deploy log, provider status)
3. COMMUNICATE
- Open the incident channel, assign a lead, post a one-line status
The first move is almost always revert to the last known-good version. The ability to do that in one step is why versioning and one-step rollback are non-negotiable in the Production Checklist.
Mitigation levers unique to LLMs
- Roll back the prompt/model/config - the fastest fix for a quality or behaviour regression.
- Switch the model / provider - if a provider update or outage is the cause and you have a gateway with fallback.
- Tighten guardrails - raise input/output filtering, force human-in-the-loop for the affected workflow.
- Disable tools or enable approval mode - for agents taking wrong actions.
- Cap cost - hard rate limits and budget alerts to stop a runaway bill.
- Fail safe - degrade to a canned response or human handoff rather than serving bad output.
What you need in place beforehand
An incident is the wrong time to discover you can’t see anything:
- Traces with
prompt_version, inputs and outputs, searchable. - A deploy/change log so you can correlate the incident with a change.
- A subscription to your providers’ status pages.
- A documented rollback path that on-call has actually practised.
Post-mortem structure
Blameless, and tuned for LLM causes:
- Summary & impact (who, how many, how long)
- Timeline (change → symptom → detection → mitigation → resolution)
- Root cause: prompt | model/provider | retrieval | data | infra | misuse
- Detection gap: why didn't an eval / alert catch this earlier?
- Action items:
- Add a regression case to the eval set
- Add / tune the alert that would have caught it
- Fix the rollback or guardrail gap
The most valuable output of any LLM incident is a new eval case and a new alert - so the same failure can never silently recur. Tie this into your governance layer and the regulated workflow blueprint if you operate under audit requirements.