LLM Cost Control: Tokens, caching and model routing

Token spend compounds quietly until a finance review forces a panic. A production guide to unit economics, caching, model routing, right-sizing and budget alerts - before the bill, not after.

LLM cost scales with traffic and context length, per token, and compounds quietly until a finance review forces a scramble. Cost control is keeping spend predictable and proportionate to value - before the invoice, not after.

The two questions

Prototype: “Can we afford the model?” Production: “What’s cost per user, per feature and per month - and what happens to the bill at 10× traffic?”

A demo runs a handful of calls. Production runs millions, and a token count that looked trivial per request becomes a five-figure line item.

What breaks in production

The surprise bill. Spend triples after a launch and no one saw it coming.
No attribution. You can’t say which feature, team or customer drove the cost, so you can’t fix it.
Context bloat. Every request stuffs the full knowledge base into the prompt, paying for tokens the model doesn’t need.
One model for everything. A frontier model answers trivial queries a model a tenth the price would handle fine.
No alerts. You learn about a runaway from finance, not a notification.

Know your unit economics

Start with cost per request, then per active user. The math is simple - and it’s exactly what the Cost Calculator computes:

def request_cost(in_tokens, out_tokens, in_price, out_price, cache_hit=0.0):
    # prices are USD per 1M tokens; cached input reads cost ~10% of input price
    in_eff = in_tokens / 1e6 * in_price * ((1 - cache_hit) + cache_hit * 0.10)
    out_cost = out_tokens / 1e6 * out_price
    return in_eff + out_cost

# per active user / month ≈ request_cost × requests_per_user_per_day × 30

Run your real numbers through the calculator to get per-request, per-1,000-users and annual figures, and to compare a smaller model.

The three levers

Caching. Cache what repeats - system prompts, retrieved context - and pay a fraction for cached reads. A 35% cache hit rate on a context-heavy app is real money.
Routing. Send easy traffic to smaller, cheaper models; reserve the frontier model for the hard cases.
Right-sizing. Cap max output tokens and context length per request - the cheapest token is the one you don’t send.

A minimal router:

def choose_model(query, context_len):
    if context_len > 8000 or looks_complex(query):
        return "sonnet-4.6"      # frontier, for the hard cases
    return "haiku-4.5"           # cheaper, for the bulk of traffic

Any routing change affects quality, so gate it on your eval set - cheaper is only cheaper if it still passes.

Alert before the breach

Attribute spend to a team, feature or customer, and alert before a budget is hit - not after:

if month_to_date_spend(feature) > 0.8 * budget(feature):
    notify(f"{feature} at 80% of monthly budget")

The token-usage signals from observability are what make this possible - you can only attribute what you measured.

Minimal vs mature

Aspect	Minimal	Production-grade
Visibility	Total spend	Per request, per user, per feature
Caching	None	System prompt + context cached
Models	One for all	Routed by difficulty / length
Limits	None	Capped tokens + context
Budget	Reviewed quarterly	Alerted at a threshold

Where this lives in a real system

Routing, caching and per-provider cost belong in a control point - see the multi-provider gateway reference architecture. Model the trade-offs in the Cost Calculator, pick tooling from cost tracking, and clear the cost items in the Production Checklist.

LLM Cost Control: Tokens, caching and model routing

The two questions#

What breaks in production#

Know your unit economics#

The three levers#

Alert before the breach#

Minimal vs mature#

Where this lives in a real system#

How to build your first eval dataset

What to log in an LLM trace

How to calculate hallucination rate