Reference architecture

Multi-provider model gateway

A gateway in front of multiple model providers - routing on cost, quality and latency, with caching, failover and one place to observe it all.

01 Architecture

As soon as you depend on more than one model or provider, you need a control point. The gateway centralises routing, caching, rate limiting, key management and observability - and gives you provider failover instead of a single point of failure.

Application unified API

Gateway cache · rate limit

Router cost · quality · latency

Providers A · B · C + fallback

Response unified tracing

02 When to use it

Use this when

You depend on more than one model or provider
You need failover or cost/quality/latency routing
You want one place to observe and cap spend

Reach for something else when

A single provider and model fully cover your needs
The extra network hop’s latency is unacceptable
You have no routing or fallback policy yet

03 Components

What's in the box.

Gateway / proxy

Single entry point: caching, rate limiting and request normalisation.

Router

Chooses a provider/model per request on cost, quality or latency rules.

Provider adapters

Normalise each provider’s API behind one interface.

Fallback logic

Retries on another provider when one errors or times out.

Cache

Serves repeated requests without hitting a provider.

Key management

Centralises and rotates provider keys; never exposes them to clients.

Unified observability

One trace, cost and latency view across all providers.

04 Failure modes

Where it breaks - and the fix.

Provider outage with no fallback

Configure cross-provider failover and health checks.

Inconsistent outputs across providers

Run the eval set per provider/model; pin defaults per use case.

Cache returning wrong/stale results

Key cache on full request context; set sensible TTLs; bypass for volatile inputs.

Key leakage

Keep keys server-side in the gateway; rotate regularly.

Cost blowout from misrouting

Budget-aware routing and per-team alerts; cap expensive models.

05 Metrics to monitor

What good looks like, measured.

Provider availability

Health that drives failover.
Cache hit rate

Requests served without a provider call.
Cross-provider eval parity

Quality consistency across providers.
Cost per route

Whether routing is actually saving money.
Fallback rate

How often the primary provider fails.

06 MVP vs production-grade

Don't build everything on day one.

Ship the MVP column to get to users; the production column is what makes it durable. Choose deliberately which gaps you're leaving.

Aspect	MVP	Production-grade
Routing	Static default	Cost/quality/latency-aware
Failover	None	Cross-provider with health checks
Cache	None	Context-keyed with TTLs
Keys	In the app	Server-side, rotated
Observability	Per provider	Unified traces + cost

07 Copy-paste schemas

Instrument it in minutes.

A starting point you can paste into your tracing and eval setup - then adapt to your stack.

Example trace schema

{
  "request_id": "req_5012",
  "architecture": "multi-provider-gateway",
  "route": "cost",
  "provider_chosen": "anthropic",
  "model": "haiku-4.5",
  "fallback_used": false,
  "cache_hit": true,
  "input_tokens": 0,
  "output_tokens": 0,
  "latency_ms": 38,
  "cost_usd": 0
}

Example eval dataset row

{
  "input": "Summarize this ticket in one sentence",
  "expected_behavior": "Produce a consistent one-sentence summary regardless of provider",
  "must_include": [
    "single sentence"
  ],
  "must_not_include": [
    "provider-specific formatting artifacts"
  ],
  "risk_category": "cross_provider_consistency"
}

08 Checklist

Ship-ready when…

Cross-provider failover is configured and tested
The eval set runs per provider/model
Caching keys on full context with sensible TTLs
Provider keys stay server-side and are rotated
Routing is budget-aware with per-team cost alerts
Tracing and cost are unified across providers

Full production checklist → Score your maturity →

09 Related

Stack layers

Cost control Deployment Observability Security

Deep dives

Llm Cost Control Tokens Caching Model Routing →Llm Observability What To Monitor In Production →