← All articles
Deployment

LLM Deployment: CI/CD, staging and one-step rollback

How to ship LLM changes safely - CI with eval gates, a staging mirror, progressive rollout, config-driven rollback and provider fallback. The deployment layer of the LLMOps stack.

June 8, 2026 · 9 min read · deployment · ci-cd · rollback

An LLM change - a new prompt, a model swap, a retrieval tweak - is a deploy like any other, and deserves the same safety net: tested in CI, rolled out progressively, reversible in one step. Skipping that is how a one-line prompt edit becomes a production incident.

The two questions

Prototype: “It runs on my machine.” Production: “Can we ship this change safely - and undo it in seconds if it regresses?”

A demo is deployed once, by hand. Production needs every change to flow through a pipeline that catches regressions before users do and can revert without a firefight.

What breaks without it

  • Prompt edits straight to prod. Someone tweaks the live prompt to fix one complaint and silently regresses ten others - the single most common cause of mystery quality drops.
  • No staging. Changes are validated in production, on real users.
  • No rollback. A bad release means a frantic redeploy of application code instead of flipping one value.
  • No fallback. Your sole provider has an outage and the whole feature is down.
  • Big-bang releases. A change ships to 100% of traffic at once; there’s no small blast radius to learn from.

The controls that matter

  • CI with an eval gate - every change runs the eval set and can’t merge on a regression.
  • A staging environment that mirrors production (same retrieval, same config).
  • Progressive rollout - canary or percentage-based, so a bad change hits few users.
  • One-step rollback - driven by config, not a code redeploy.
  • Provider / model fallback - automatic failover when a provider errors or times out.
  • Change management - for regulated contexts, a documented, auditable process.

Instrument it: CI with an eval gate

Wire evals into the pipeline so quality is a merge condition, not a hope:

# .github/workflows/deploy.yml (sketch)
on:
  pull_request:
    paths: ['prompts/**', 'src/**', 'evals/**']
jobs:
  evaluate:
    steps:
      - run: python run_evals.py --data evals/
      # fail if pass_rate drops below threshold or a "critical" case regresses
  deploy-canary:
    needs: evaluate
    if: github.ref == 'refs/heads/main'
    steps:
      - run: ./deploy.sh --canary 5   # 5% of traffic first

Instrument it: rollback by config, not redeploy

Load the active prompt and model from configuration so a revert is a value change, not a deploy:

# config (env var, feature flag, or small config service)
ACTIVE = {"prompt_version": "rag-answer-v4", "model": "sonnet-4.6"}

# rollback = point back at the previous known-good version, instantly
# ACTIVE = {"prompt_version": "rag-answer-v3", "model": "sonnet-4.6"}

Pair this with a canary check: route a slice of traffic to the new version, watch latency, error rate and evals, and promote or roll back automatically.

Minimal vs mature

AspectMinimalProduction-grade
TestingManual checkEval gate in CI
EnvironmentsProd onlyStaging mirrors prod
RolloutAll at onceCanary / percentage
RollbackCode redeployOne-step, config-driven
ResilienceSingle providerCross-provider fallback

Where this lives in a real system

Provider failover and routing belong in a control point - see the multi-provider gateway reference architecture. Rollback depends on treating prompts as versioned artifacts, covered in prompt versioning with GitHub. And the deployment items in the Production Checklist are the bar to clear before you ship.

Get the Production Checklist Explore the Stack