LLM Cost Control: Tokens, caching and model routing
Token spend compounds quietly until a finance review forces a panic. A production guide to unit economics, caching, model routing, right-sizing and budget alerts - before the bill, not after.
LLM cost scales with traffic and context length, per token, and compounds quietly until a finance review forces a scramble. Cost control is keeping spend predictable and proportionate to value - before the invoice, not after.
The two questions
Prototype: “Can we afford the model?” Production: “What’s cost per user, per feature and per month - and what happens to the bill at 10× traffic?”
A demo runs a handful of calls. Production runs millions, and a token count that looked trivial per request becomes a five-figure line item.
What breaks in production
- The surprise bill. Spend triples after a launch and no one saw it coming.
- No attribution. You can’t say which feature, team or customer drove the cost, so you can’t fix it.
- Context bloat. Every request stuffs the full knowledge base into the prompt, paying for tokens the model doesn’t need.
- One model for everything. A frontier model answers trivial queries a model a tenth the price would handle fine.
- No alerts. You learn about a runaway from finance, not a notification.
Know your unit economics
Start with cost per request, then per active user. The math is simple - and it’s exactly what the Cost Calculator computes:
def request_cost(in_tokens, out_tokens, in_price, out_price, cache_hit=0.0):
# prices are USD per 1M tokens; cached input reads cost ~10% of input price
in_eff = in_tokens / 1e6 * in_price * ((1 - cache_hit) + cache_hit * 0.10)
out_cost = out_tokens / 1e6 * out_price
return in_eff + out_cost
# per active user / month ≈ request_cost × requests_per_user_per_day × 30
Run your real numbers through the calculator to get per-request, per-1,000-users and annual figures, and to compare a smaller model.
The three levers
- Caching. Cache what repeats - system prompts, retrieved context - and pay a fraction for cached reads. A 35% cache hit rate on a context-heavy app is real money.
- Routing. Send easy traffic to smaller, cheaper models; reserve the frontier model for the hard cases.
- Right-sizing. Cap max output tokens and context length per request - the cheapest token is the one you don’t send.
A minimal router:
def choose_model(query, context_len):
if context_len > 8000 or looks_complex(query):
return "sonnet-4.6" # frontier, for the hard cases
return "haiku-4.5" # cheaper, for the bulk of traffic
Any routing change affects quality, so gate it on your eval set - cheaper is only cheaper if it still passes.
Alert before the breach
Attribute spend to a team, feature or customer, and alert before a budget is hit - not after:
if month_to_date_spend(feature) > 0.8 * budget(feature):
notify(f"{feature} at 80% of monthly budget")
The token-usage signals from observability are what make this possible - you can only attribute what you measured.
Minimal vs mature
| Aspect | Minimal | Production-grade |
|---|---|---|
| Visibility | Total spend | Per request, per user, per feature |
| Caching | None | System prompt + context cached |
| Models | One for all | Routed by difficulty / length |
| Limits | None | Capped tokens + context |
| Budget | Reviewed quarterly | Alerted at a threshold |
Where this lives in a real system
Routing, caching and per-provider cost belong in a control point - see the multi-provider gateway reference architecture. Model the trade-offs in the Cost Calculator, pick tooling from cost tracking, and clear the cost items in the Production Checklist.