LLM incident response template

A ready-to-adapt incident response template for LLM applications - severity levels, the first 15 minutes, mitigation levers unique to LLMs, and a post-mortem structure.

LLM apps fail in ways classical services don’t: a prompt edit tanks quality, a provider ships a model update overnight, an injection makes the agent misbehave, or a traffic spike triples the bill. Your normal incident process mostly applies - but the levers are different. Here’s a template to adapt.

Severity levels

Define these before you need them:

Sev	Example	Response
SEV1	Harmful output, data leak, or agent taking wrong actions at scale	All-hands, immediate mitigation, comms
SEV2	Quality regression affecting many users; cost runaway	On-call owns, mitigate within the hour
SEV3	Degraded latency, elevated error/refusal rate	Triage in business hours

The LLM-specific additions to your usual list: harmful/unsafe output, prompt-injection exploitation, hallucination spike, and cost runaway.

The first 15 minutes

1. STOP THE BLEEDING
   - Roll back the last prompt / model / config change (one step - you versioned it)
   - If an agent is taking actions: disable the high-risk tools / force approval mode
   - If cost runaway: enable hard rate limits / budget cap

2. CONFIRM SCOPE
   - Which feature, which prompt_version, since when?
   - Pull traces for affected requests; check what changed (deploy log, provider status)

3. COMMUNICATE
   - Open the incident channel, assign a lead, post a one-line status

The first move is almost always revert to the last known-good version. The ability to do that in one step is why versioning and one-step rollback are non-negotiable in the Production Checklist.

Mitigation levers unique to LLMs

Roll back the prompt/model/config - the fastest fix for a quality or behaviour regression.
Switch the model / provider - if a provider update or outage is the cause and you have a gateway with fallback.
Tighten guardrails - raise input/output filtering, force human-in-the-loop for the affected workflow.
Disable tools or enable approval mode - for agents taking wrong actions.
Cap cost - hard rate limits and budget alerts to stop a runaway bill.
Fail safe - degrade to a canned response or human handoff rather than serving bad output.

What you need in place beforehand

An incident is the wrong time to discover you can’t see anything:

Traces with prompt_version, inputs and outputs, searchable.
A deploy/change log so you can correlate the incident with a change.
A subscription to your providers’ status pages.
A documented rollback path that on-call has actually practised.

Post-mortem structure

Blameless, and tuned for LLM causes:

- Summary & impact (who, how many, how long)
- Timeline (change → symptom → detection → mitigation → resolution)
- Root cause: prompt | model/provider | retrieval | data | infra | misuse
- Detection gap: why didn't an eval / alert catch this earlier?
- Action items:
    - Add a regression case to the eval set
    - Add / tune the alert that would have caught it
    - Fix the rollback or guardrail gap

The most valuable output of any LLM incident is a new eval case and a new alert - so the same failure can never silently recur. Tie this into your governance layer and the regulated workflow blueprint if you operate under audit requirements.

LLM incident response template

Severity levels#

The first 15 minutes#

Mitigation levers unique to LLMs#

What you need in place beforehand#

Post-mortem structure#

What your CTO should ask before approving an LLM launch

LLM Governance: Audit trails, approvals and explainability

What is LLMOps?

Severity levels

The first 15 minutes

Mitigation levers unique to LLMs

What you need in place beforehand

Post-mortem structure