Cut model spend without weakening quality.
Most of your traffic is easy. Route each call to the cheapest model that still clears your quality floor — and keep the strong model for the requests that actually need it.
One expensive model is answering all of your traffic.
A triage classification and a legal-risk review are not the same request, but a hardcoded gpt-5.5 bills them the same. The easy 80% — routine tickets, cheap JSON, short summaries — runs on a frontier model it never needed, and the invoice is the line item nobody can explain.
The fix isn't a cheaper model everywhere — that fails the hard requests. It's matching each call to the cheapest model that passes the rules for that call. See where else the model decision layer applies →
Hardcode a strong model, or hand-write routing that rots.
Both paths end in the same place: a model decision buried in application code, with no record of why any given call went where it did.
Pin one strong model
Set model: "gpt-5.5" once and move on. Correctness is fine; the bill is not. You're paying frontier prices for traffic a $0.018 model would have handled — and you can't see how much.
Hand-write routing conditionals
An if cheap_enough then gemini else gpt-5.5 ladder spreads across the codebase. Every new model, price change, or provider outage means editing branches and redeploying — and the logic drifts out of date the week you ship it.
Filter by the quality floor, then take the cheapest survivor.
Your backend sends a small policy with each call. The router evaluates it over the live catalog in four steps — and writes down every one. The quality floor is the catalog field bench_intelligence, a 0..1 score: ["cmp","bench_intelligence","ge",0.5]. It's an explicit field you can inspect, override, or replace with your own eval scores — not a black box.
Drop what can't qualify
Remove every model below the quality floor or missing a hard requirement — tools, context, region, price ceiling. deepseek-v4-flash (score 0.42) and a no-tools mistral-small-4 fall out here.
Order by cost
Sort the survivors by what you optimize for — usually cost per run, sometimes latency or score.
Take the cheapest
The top-ranked survivor wins. For easy traffic that's gemini-3.5-flash (score 0.54 · $0.018), not the $0.063 baseline.
Advance on failure
If the pick errors or times out, move to the next survivor — same rules, no redeploy. Fallback is a runtime failure path, not a retry hack.
No model string lives in your code, so a price change or a new model updates the decision without a deploy. Fallback is policy, not retry code — see handling provider failures.
Preview which models pass with a dry-run —/x/fields and /x/rank
A routine ticket on $0.018, a hard one on a higher floor.
Same policy, three requests. A routine ticket clears the floor cheaply; a complex one raises the floor and pulls a stronger model; a provider failure rolls to the next passing candidate. The trace records every decision and the baseline it's measured against.
illustrative Scores and costs match the published catalog example. The 71% saving is the per-run delta from the gpt-5.5 baseline on this request. See how the trace proves every decision →
The saving is a number you can put on a slide.
Because every run records its winner, its rejected candidates, and the baseline it's measured against, spend becomes auditable instead of a mystery line item. The figures below are illustrative.
gpt-5.5 on the same traffic.gemini-3.5-flash, stronger where the floor rises.Wondering where the model decision should live versus a gateway or eval tool? See the comparison →
Stop overpaying for easy traffic. Route by the rule.
Send a policy with each call and let the cheapest passing model answer — over your own keys, with a trace that proves the saving. Join the waitlist for early access.