Policy-based LLM routing is the practice of choosing which model serves each request from a set of rules you send with the call, instead of from a model name hardcoded in your application. You describe what the call requires — capabilities, a quality floor, a price ceiling, a fallback order — and a router evaluates that policy against the live catalog of available models, then routes to the cheapest model that still passes. It runs over your own provider keys, fails over when a model is unavailable, and writes a replayable, auditable trace of exactly which models it considered, why it rejected some, and which one it chose. This guide is the complete picture: what it is, why hardcoded model names quietly cost you, how a single decision is computed, and how to adopt it without rewriting your app.
In one line: policy-based routing turns "which model do I call?" from a string literal buried in your codebase into a decision that is computed, enforced, and logged at runtime — the same way authorization or feature flagging eventually earned their own layer instead of living as scattered conditionals.
What policy-based LLM routing is
Most teams write their first LLM call by naming a model: model: "gpt-5.5", then ship it. That string is a decision frozen at deploy time and applied to every request afterward, regardless of how each request differs from the one you were looking at when you typed it. Policy-based routing replaces the frozen answer with a live question. You stop telling the system which model and start telling it what the call needs; the routing layer resolves that description against the current set of models on every request.
Three responsibilities define the pattern, and all three matter:
- Decide. Given your rules and the models that can currently serve the request, pick one. The selection is a function of your policy and the live catalog — not a value you typed once and forgot.
- Enforce. A model that violates your rules is never chosen. The router does not "prefer" the right model; it structurally excludes the wrong ones before it ranks the rest.
- Account. Every decision produces a record — the candidates, the filters that removed some of them, the ranking, the winner, the cost, the latency, and any failover hops — and that record is replayable.
The unit that carries the rules is a policy. With unhardcoded, that policy is expressed as a policy_ir: a compact, structured term your backend generates and sends in the request body. The deeper concept piece, what an LLM policy router actually is, draws the sharp line between a router that reasons about model choice and the plumbing that merely moves bytes. This guide builds on that foundation and connects every part.
Why hardcoded model names break down
A hardcoded model name is a decision you made once and can no longer see. It is fine when every call is identical and the right model never changes. In production, neither is true: your traffic is a mix of trivial classifications and genuinely hard reasoning, new models ship every few weeks, providers have outages, and finance wants to know where the money went. A static string answers none of that, and because it answers silently, the gap between the model you picked and the model a given call needed never shows up on a dashboard. It shows up on the invoice.
The dedicated argument piece, stop hardcoding model decisions, breaks down the five quiet taxes in full. In short, a hardcoded name means you overpay for the flagship model on easy traffic, every model change becomes a redeploy, there is no automatic failover when a provider rate-limits, there is no per-decision audit when someone asks why a call cost what it did, and you accumulate lock-in by inertia as model names scatter across dozens of call sites. None of these is catastrophic alone. Together they mean you run more expensive, less reliable, and less explainable than you have to — and you cannot see any of it, because the decision that causes it is invisible by design.
The core problem: a hardcoded model is an unconditional decision applied to conditional traffic. It can only ever be right by accident, and it is wrong without telling you.
How a routing decision is made: filter, rank, select, mutate, fallback
A policy-based router does not guess. It runs every request through the same deterministic pipeline, in the same order, every time. Understanding that order is the key to trusting the result, because the order is what turns a preference into a guarantee.
- Filter. Apply your hard constraints and drop any model that fails one — context window below a threshold, missing a required capability, above a price ceiling, in the wrong region. A failing model is removed from the candidate set, never silently substituted.
- Rank. Order the survivors by your objective — cheapest first is the common case, but you can rank by latency or another dimension.
- Select. Take the top-ranked candidate deterministically: one model, the cheapest that cleared every filter.
- Mutate. Apply adjustments to the chosen call — for example, clamp parameters to the selected model's limits — so the request is valid for the model that will actually run it.
- Fallback. If the chosen model times out or errors, move to the next passing candidate in order. Every hop is recorded; nothing is improvised.
Here is the pipeline applied to a real-shaped problem. Suppose you are summarizing support tickets: most are short, a few are long threads that need a big context window, you want the cheapest model that can do the job, you want a known fallback if your first choice is down, and you never want to send personally identifiable data to a model outside your region.
# The catalog at request time (illustrative)
candidates = [
"gemini-3.5-flash", # 1M ctx, in-region, cheap → passes
"claude-haiku-4-5", # in-region, cheap → passes
"gpt-5.5", # the old hardcoded baseline → struck
]
# filter: ctx >= 200k AND region == "eu" AND price <= ceiling
# rank: cheapest first
# select: best passing candidate, then failover in order
→ selected: gemini-3.5-flash
→ reason: cheapest model passing all filters
→ fallback: claude-haiku-4-5 (if primary unavailable)
→ trace: written — candidates, filters, winner, cost, latency
Three things just happened that a gateway or marketplace would not do for you. First, the router chose the model from your rules rather than from a string you typed. Second, the choice was enforced: any out-of-region or undersized model was excluded, not merely deprioritized. Third, the decision was recorded, so when a finance reviewer asks why this request cost what it did, or an incident review asks which model served a bad answer at 3 a.m., the answer is in the trace — not reconstructed from memory. When a cheaper, in-policy model launches next quarter, you do not hunt down conditionals across services: the catalog updates, your policy stays the same, and the router starts selecting the better option on its own.
The policy_ir: rules as data, sent with the call
The pipeline above is only as trustworthy as the thing that describes the rules. With unhardcoded, that thing is the policy_ir: a small, structured term — a JSON array — that the router can parse, validate, hash, and evaluate deterministically. It is not English you hope the system interprets correctly; it is a compact instruction language with a fixed shape: the "policy" tag, an evidence slot, then five working verbs that map one-to-one onto the pipeline stages.
["policy", ["ev_zero"], /* evidence */
..., /* filter */
..., /* rank */
..., /* select */
..., /* mutate */
...] /* fallback*/
Because the policy is structured rather than free-form, the router can admit it before running it — rejecting unknown operations or undeclared fields instead of silently doing the wrong thing. It can also hash it, so the same rules produce the same identifier in your trace every time. And because it travels in the request body, the same endpoint can carry a different policy on the very next call; nothing is pinned to a server config your application cannot see. For the full breakdown of each position and how a single decision is evaluated, read what is a policy_ir? The anatomy of a routing decision. The policy router is the broad pattern; the policy_ir is the concrete artifact that makes the pattern enforceable and auditable.
Enforced, not advisory — and why that makes it auditable
This is the line that separates policy-based routing from a recommendation engine. There are two ways a system can "respect" your rules:
- Advisory: the system suggests a model and trusts the caller to follow the suggestion. Nothing stops a request from reaching a model that breaks the rule. The rule is a hint.
- Enforced: the system will not return a result from a model that violates the policy. A model that fails a filter is removed from the candidate set before ranking even begins. The rule is a guarantee.
The difference is not academic. If your policy says "context window at least 200k tokens and price under a ceiling," an advisory system might still route a long document to a model that truncates it, and you would only find out from a degraded answer. An enforced router cannot do that: the undersized model is filtered out, full stop. The constraint holds whether or not anyone is watching, on the millionth request as reliably as the first.
Enforcement is also why the trace is trustworthy. Because every decision passes through the same filter-rank-select machinery, the trace is not a best-effort log written after the fact — it is the actual record of the decision the router was structurally bound to make. You can replay it, hand it to a reviewer, or diff it against yesterday's behavior. That is what makes the routing auditable rather than merely logged, and it is the property that serves engineering, platform, and finance from a single record: engineering reads which rule the chosen model passed, platform verifies the floor held on every request, and finance gets per-decision cost attribution instead of one opaque provider invoice.
The floor is a guarantee, not a suggestion. The router optimizes cost beneath your floor, never around it. If no model meets the requirements, the request fails loudly — you never get a silent downgrade you did not ask for.
Your keys, per-run pricing, no token resale
A routing layer sits in a sensitive spot: between your application and your providers. Two design choices determine whether it stays aligned with your interests. The first is whose keys run the inference. With unhardcoded, you bring your own provider keys and pay your providers directly. The router never holds your inference hostage and never inserts itself as the billing party for tokens.
The second is how the layer is paid. unhardcoded is priced per run, not per token: you pay for the routing decision and its trace, not a markup on inference. That keeps incentives honest — the router is paid to route well and stay current, not to mark up the models it picks. Its interest and yours point the same way: toward the cheapest model that meets your rules. The economics are spelled out on the pricing page. This is the structural difference from a token-reselling marketplace, where the layer profits more when you spend more.
How to get started without a rewrite
Adopting policy-based routing does not mean rebuilding your application. unhardcoded is OpenAI-compatible, so the migration is small and mechanical. You point your SDK's base URL at the endpoint, set any policy:* model string — it is just a free-form label your trace history is grouped under, not a routing instruction — and attach a policy_ir to the call. The routing comes entirely from that attached term; the model string is never parsed for a model. Your messages, tools, and parameters pass through unchanged.
// generated in your backend, at request time — the raw policy_ir term
const policy = [
"policy",
["ev_zero"], // evidence: reserved slot, none attached
// filter: meets request reqs, not disabled, supports tools, under the ceiling
["and", ["meets_req"], ["not", ["is", "disabled"]],
["has_cap", "tools"], ["cmp", "price_out", "le", 6.0]],
// rank: cheapest = negate normalized output price
["neg", ["normalize", ["field", "price_out"]]],
["argmax"], // select: the single best survivor
["id"], // mutate: no-op
["always", { action: "next_candidate" }], // fallback
];
const res = await client.chat.completions.create({
model: "policy:support", // a free-form trace label, not a model name
policy_ir: policy, // the decision, sent with the call
messages, // unchanged
});
The raw term is the interface — a plain JSON array you can hand-write, generate, hash, and replay. (A higher-level buildPolicy(...) helper that compiles a short spec down to this term is planned convenience sugar, not a shipping package yet; the array above is what the router actually admits.) On a representative run the router lands on gemini-3.5-flash: it clears the request requirements, supports tools, and sits under the price ceiling. If it times out, the router fails over to the next passing candidate and writes the hop to the trace. The struck gpt-5.5 baseline you are leaving behind never appears in the live decision — it is only the string this replaces. For a step-by-step walkthrough of the first call, follow the 5-minute quickstart, and keep the documentation open as you go.
How it relates to gateways, marketplaces, and proxies
"Router" is an overloaded word. Policy-based routing is a higher layer of abstraction that can sit on top of the plumbing you already know — it is not a competitor to it. The clean test is whether a component reasons about which model should serve the request or just moves bytes: if you still have to name the model, you have a gateway, a proxy, a balancer, or a marketplace; if you describe the rules the model must satisfy and let the system choose, you have a policy router.
- API gateways and proxies forward your request and add cross-cutting concerns — auth, rate limiting, retries, caching, logging. They answer "how do I reach this endpoint reliably?" You still name the model.
- Load balancers spread traffic across interchangeable backends. Their core assumption is that the targets are equivalent. Models are not equivalent — they differ in price, context window, capability, latency, and provider — so a policy router selects on those differences instead of treating them as noise.
- Model marketplaces and aggregators give you one key and one bill across many providers, usually by reselling tokens at a markup. That solves access and billing, not the decision: you still pick the model and still pay a per-token margin.
For a tool-by-tool look at where each option puts the routing decision and what that means for your bill, read the comparison of unhardcoded versus Portkey, LiteLLM, and OpenRouter. And for where this is all heading — every LLM call eventually carrying its own portable, enforceable policy — see the runtime policy layer vision. The pattern is open at its core: the policy_ir is a format you can read and reason about, and the reference pieces are part of our open-source work.
Bottom line: a hardcoded model is a silent, unconditional bet on one provider for all of your traffic. A policy is a written, conditional rule the router enforces and records on every call — cheaper, more reliable, and finally auditable, for the cost of changing one line.