use case · cut model spend

Cut model spend without weakening quality.

Most of your traffic is easy. Route each call to the cheapest model that still clears your quality floor — and keep the strong model for the requests that actually need it.

the problem

One expensive model is answering all of your traffic.

A triage classification and a legal-risk review are not the same request, but a hardcoded gpt-5.5 bills them the same. The easy 80% — routine tickets, cheap JSON, short summaries — runs on a frontier model it never needed, and the invoice is the line item nobody can explain.

The fix isn't a cheaper model everywhere — that fails the hard requests. It's matching each call to the cheapest model that passes the rules for that call. See where else the model decision layer applies →

the old way

Hardcode a strong model, or hand-write routing that rots.

Both paths end in the same place: a model decision buried in application code, with no record of why any given call went where it did.

Pin one strong model

Set model: "gpt-5.5" once and move on. Correctness is fine; the bill is not. You're paying frontier prices for traffic a $0.018 model would have handled — and you can't see how much.

Hand-write routing conditionals

An if cheap_enough then gemini else gpt-5.5 ladder spreads across the codebase. Every new model, price change, or provider outage means editing branches and redeploying — and the logic drifts out of date the week you ship it.

the unhardcoded way

Filter by the quality floor, then take the cheapest survivor.

Your backend sends a small policy with each call. The router evaluates it over the live catalog in four steps — and writes down every one. The quality floor is the catalog field bench_intelligence, a 0..1 score: ["cmp","bench_intelligence","ge",0.5]. It's an explicit field you can inspect, override, or replace with your own eval scores — not a black box.

filter

Drop what can't qualify

Remove every model below the quality floor or missing a hard requirement — tools, context, region, price ceiling. deepseek-v4-flash (score 0.42) and a no-tools mistral-small-4 fall out here.

rank

Order by cost

Sort the survivors by what you optimize for — usually cost per run, sometimes latency or score.

select

Take the cheapest

The top-ranked survivor wins. For easy traffic that's gemini-3.5-flash (score 0.54 · $0.018), not the $0.063 baseline.

fallback

Advance on failure

If the pick errors or times out, move to the next survivor — same rules, no redeploy. Fallback is a runtime failure path, not a retry hack.

No model string lives in your code, so a price change or a new model updates the decision without a deploy. Fallback is policy, not retry code — see handling provider failures.

Preview which models pass with a dry-run — /x/fields and /x/rank

example · support workload

A routine ticket on $0.018, a hard one on a higher floor.

Same policy, three requests. A routine ticket clears the floor cheaply; a complex one raises the floor and pulls a stronger model; a provider failure rolls to the next passing candidate. The trace records every decision and the baseline it's measured against.

trace · req_8f41c2 200 OK

routine ticketgemini-3.5-flash · score 0.54 · $0.018

reasonfloor ge 0.5 ✓, tools ✓, cheapest survivor

complex ticketclaude-sonnet-4-6 · cleared higher floor ge 0.55 · $0.041

on failureclaude-sonnet-4-6 → gemini-3.5-flash · 504 timed out · next passing candidate

cost$0.018 · ↓71% vs gpt-5.5 baseline ($0.063)

policysupport · fingerprint pol_8f41c2 · sigma-pol/v1

illustrative Scores and costs match the published catalog example. The 71% saving is the per-run delta from the gpt-5.5 baseline on this request. See how the trace proves every decision →

what your dashboard shows

The saving is a number you can put on a slide.

Because every run records its winner, its rejected candidates, and the baseline it's measured against, spend becomes auditable instead of a mystery line item. The figures below are illustrative.

$0.063Baseline cost per run — a hardcoded gpt-5.5 on the same traffic.

$0.021Blended selected-model cost — mostly gemini-3.5-flash, stronger where the floor rises.

↓67%Savings — the delta between baseline and what the policy actually selected.

82%Frontier calls avoided — requests that cleared the floor without a frontier model.

3,140Models rejected by rule — each with the rule that filtered it: floor, tools, region, price.

0No-pass requests — when nothing clears the floor it fails loud, never a silent downgrade.

Dry-run your first policy Join the waitlist

Wondering where the model decision should live versus a gateway or eval tool? See the comparison →

Stop overpaying for easy traffic. Route by the rule.

Send a policy with each call and let the cheapest passing model answer — over your own keys, with a trace that proves the saving. Join the waitlist for early access.

Read the docs

No SDK rewriteYour provider keysPriced per runEvery run traced