← All articles
Cost control

LLM Cost Control: Tokens, caching and model routing

Token spend compounds quietly until a finance review forces a panic. A production guide to unit economics, caching, model routing, right-sizing and budget alerts - before the bill, not after.

May 29, 2026 · updated June 8, 2026 · 9 min read · cost · caching · routing

LLM cost scales with traffic and context length, per token, and compounds quietly until a finance review forces a scramble. Cost control is keeping spend predictable and proportionate to value - before the invoice, not after.

The two questions

Prototype: “Can we afford the model?” Production: “What’s cost per user, per feature and per month - and what happens to the bill at 10× traffic?”

A demo runs a handful of calls. Production runs millions, and a token count that looked trivial per request becomes a five-figure line item.

What breaks in production

  • The surprise bill. Spend triples after a launch and no one saw it coming.
  • No attribution. You can’t say which feature, team or customer drove the cost, so you can’t fix it.
  • Context bloat. Every request stuffs the full knowledge base into the prompt, paying for tokens the model doesn’t need.
  • One model for everything. A frontier model answers trivial queries a model a tenth the price would handle fine.
  • No alerts. You learn about a runaway from finance, not a notification.

Know your unit economics

Start with cost per request, then per active user. The math is simple - and it’s exactly what the Cost Calculator computes:

def request_cost(in_tokens, out_tokens, in_price, out_price, cache_hit=0.0):
    # prices are USD per 1M tokens; cached input reads cost ~10% of input price
    in_eff = in_tokens / 1e6 * in_price * ((1 - cache_hit) + cache_hit * 0.10)
    out_cost = out_tokens / 1e6 * out_price
    return in_eff + out_cost

# per active user / month ≈ request_cost × requests_per_user_per_day × 30

Run your real numbers through the calculator to get per-request, per-1,000-users and annual figures, and to compare a smaller model.

The three levers

  • Caching. Cache what repeats - system prompts, retrieved context - and pay a fraction for cached reads. A 35% cache hit rate on a context-heavy app is real money.
  • Routing. Send easy traffic to smaller, cheaper models; reserve the frontier model for the hard cases.
  • Right-sizing. Cap max output tokens and context length per request - the cheapest token is the one you don’t send.

A minimal router:

def choose_model(query, context_len):
    if context_len > 8000 or looks_complex(query):
        return "sonnet-4.6"      # frontier, for the hard cases
    return "haiku-4.5"           # cheaper, for the bulk of traffic

Any routing change affects quality, so gate it on your eval set - cheaper is only cheaper if it still passes.

Alert before the breach

Attribute spend to a team, feature or customer, and alert before a budget is hit - not after:

if month_to_date_spend(feature) > 0.8 * budget(feature):
    notify(f"{feature} at 80% of monthly budget")

The token-usage signals from observability are what make this possible - you can only attribute what you measured.

Minimal vs mature

AspectMinimalProduction-grade
VisibilityTotal spendPer request, per user, per feature
CachingNoneSystem prompt + context cached
ModelsOne for allRouted by difficulty / length
LimitsNoneCapped tokens + context
BudgetReviewed quarterlyAlerted at a threshold

Where this lives in a real system

Routing, caching and per-provider cost belong in a control point - see the multi-provider gateway reference architecture. Model the trade-offs in the Cost Calculator, pick tooling from cost tracking, and clear the cost items in the Production Checklist.

Get the Production Checklist Explore the Stack