LLM Deployment: CI/CD, staging and one-step rollback

How to ship LLM changes safely - CI with eval gates, a staging mirror, progressive rollout, config-driven rollback and provider fallback. The deployment layer of the LLMOps stack.

An LLM change - a new prompt, a model swap, a retrieval tweak - is a deploy like any other, and deserves the same safety net: tested in CI, rolled out progressively, reversible in one step. Skipping that is how a one-line prompt edit becomes a production incident.

The two questions

Prototype: “It runs on my machine.” Production: “Can we ship this change safely - and undo it in seconds if it regresses?”

A demo is deployed once, by hand. Production needs every change to flow through a pipeline that catches regressions before users do and can revert without a firefight.

What breaks without it

Prompt edits straight to prod. Someone tweaks the live prompt to fix one complaint and silently regresses ten others - the single most common cause of mystery quality drops.
No staging. Changes are validated in production, on real users.
No rollback. A bad release means a frantic redeploy of application code instead of flipping one value.
No fallback. Your sole provider has an outage and the whole feature is down.
Big-bang releases. A change ships to 100% of traffic at once; there’s no small blast radius to learn from.

The controls that matter

CI with an eval gate - every change runs the eval set and can’t merge on a regression.
A staging environment that mirrors production (same retrieval, same config).
Progressive rollout - canary or percentage-based, so a bad change hits few users.
One-step rollback - driven by config, not a code redeploy.
Provider / model fallback - automatic failover when a provider errors or times out.
Change management - for regulated contexts, a documented, auditable process.

Instrument it: CI with an eval gate

Wire evals into the pipeline so quality is a merge condition, not a hope:

# .github/workflows/deploy.yml (sketch)
on:
  pull_request:
    paths: ['prompts/**', 'src/**', 'evals/**']
jobs:
  evaluate:
    steps:
      - run: python run_evals.py --data evals/
      # fail if pass_rate drops below threshold or a "critical" case regresses
  deploy-canary:
    needs: evaluate
    if: github.ref == 'refs/heads/main'
    steps:
      - run: ./deploy.sh --canary 5   # 5% of traffic first

Instrument it: rollback by config, not redeploy

Load the active prompt and model from configuration so a revert is a value change, not a deploy:

# config (env var, feature flag, or small config service)
ACTIVE = {"prompt_version": "rag-answer-v4", "model": "sonnet-4.6"}

# rollback = point back at the previous known-good version, instantly
# ACTIVE = {"prompt_version": "rag-answer-v3", "model": "sonnet-4.6"}

Pair this with a canary check: route a slice of traffic to the new version, watch latency, error rate and evals, and promote or roll back automatically.

Minimal vs mature

Aspect	Minimal	Production-grade
Testing	Manual check	Eval gate in CI
Environments	Prod only	Staging mirrors prod
Rollout	All at once	Canary / percentage
Rollback	Code redeploy	One-step, config-driven
Resilience	Single provider	Cross-provider fallback

Where this lives in a real system

Provider failover and routing belong in a control point - see the multi-provider gateway reference architecture. Rollback depends on treating prompts as versioned artifacts, covered in prompt versioning with GitHub. And the deployment items in the Production Checklist are the bar to clear before you ship.

LLM Deployment: CI/CD, staging and one-step rollback

The two questions#

What breaks without it#

The controls that matter#

Instrument it: CI with an eval gate#

Instrument it: rollback by config, not redeploy#

Minimal vs mature#

Where this lives in a real system#

The LLMOps Stack: The 8 layers of production LLM systems

LLMOps Checklist: From prototype to production

Prompt versioning with GitHub