Fallback is policy, not retry code.
Providers fail, rate-limit, and spike latency. When yours does, unhardcoded routes to the next model that still passes your policy — cheapest-first, no redeploy, and the user never sees it.
One pinned model is one point of failure.
Providers go down, return 429s under load, degrade, or quietly get slow. If your app hardcodes a single model, every one of those events is a user-facing outage — and the only lever you have is a deploy.
Outages
A provider returns 500s or refuses the request entirely. Your one model is the whole answer path.
Rate limits
Under load you hit a 429 ceiling. The model is fine — you just can't reach it right now.
Degradation
Quality or availability dips for a region or a window. Nothing errors; responses just get worse.
Latency spikes
The call hangs and times out at 504. To the user that is indistinguishable from a crash.
Retry loops and fallback logic scattered across the app.
The usual fix is a try/except around the call, a hardcoded "if the primary fails, try this other model" branch, and a backoff loop bolted on next to it. It works until it doesn't.
It rots
The backup model name is hardcoded in a different file from the primary. When the catalog changes, the fallback branch is the one nobody updates.
It downgrades blindly
A naive fallback can drop you onto a cheaper model that no longer meets your quality bar — and nothing in the code knows that bar exists.
It's invisible
When a fallback fires, you find out from a customer. There's no record of what failed, what was tried next, or why.
Fall back to the next model that still passes the policy.
Your policy already defines a set of survivors — every model that clears the hard requirements, ranked by cost. Fallback isn't a separate code path; it's the next survivor in that same list. Same rules, no redeploy, decided at request time.
filter → rank → select
The router drops models that fail a hard requirement — context, tools, region, price ceiling, or the quality floor — then ranks the survivors by cost and selects the cheapest.
fallback advances in order
If the pick errors or times out, the router advances to the next-cheapest survivor — a model the same policy already approved. No new model is invented mid-request.
per step in workflows
In a workflow, fallback is per node. If one step's model fails, that step falls back without restarting the graph — and the whole run still writes one stitched trace.
The primary timed out. The user never knew.
Every run writes a receipt. When a fallback fires, the trace shows exactly what failed, what was tried next, and why the replacement was eligible — so a 504 becomes a line in the record instead of a support ticket. Figures below are illustrative.
The replacement isn't a guess — it's the next survivor the same policy already approved. See how the trace records every fallback →
Fallback never silently drops below the floor.
Every fallback hop runs the same filter → rank → select → fallback pipeline as the original call, so a replacement must clear the quality floor — the catalog field bench_intelligence, a 0..1 score — just like the primary did. A cheaper-but-weaker model such as deepseek-v4-flash stays filtered out even in an outage.
If no model in the survivor list passes, the request fails loud rather than quietly answering with something worse. You can preview exactly which models would survive an incident with a dry run — GET /x/fields for the live field vocabulary and POST /x/rank to see the ranked survivors — before any real traffic depends on it. Read the dry-run docs to try it.
Make the outage a line in the trace.
Stop hand-wiring retry loops and stale backup-model branches. Let your policy carry the fallback order, keep it above the quality floor, and prove what happened on every run.