Policy-based LLM routing is the practice of choosing which model serves each request from a set of rules you send with the call, instead of from a model name hardcoded in your application. You describe what the call requires — capabilities, a quality floor, a price ceiling, a fallback order — and a router evaluates that policy against the live catalog of available models, then routes to the cheapest model that still passes. It runs over your own provider keys, fails over when a model is unavailable, and writes a replayable, auditable trace of exactly which models it considered, why it rejected some, and which one it chose. This guide is the complete picture: what it is, why hardcoded model names quietly cost you, how a single decision is computed, and how to adopt it without rewriting your app.
Most of what follows already has a home: the concept, the term, the first call, the argument against hardcoding. This guide's job is to put them in one frame — to show how policy, routing, fallback, trace, and workflows are not five features but one system, each link in the chain implied by the one before it. We recap each piece in a sentence and point you to the post that owns it; the connective tissue is what lives here.
What policy-based LLM routing is
Most teams write their first LLM call by naming a model: model: "gpt-5.5", then ship it. That string is a decision frozen at deploy time and applied to every request afterward, regardless of how each request differs from the one you were looking at when you typed it. Policy-based routing replaces the frozen answer with a live question. You stop telling the system which model and start telling it what the call needs; the routing layer resolves that description against the current set of models on every request.
A policy router carries three responsibilities — it decides which model serves the request, enforces your rules by excluding violators rather than merely deprioritizing them, and accounts for the choice in a replayable record. The concept piece, what an LLM policy router actually is, defines the category in full — including the clean test that separates a router from the plumbing that just moves bytes. This guide takes that definition as given and connects it to everything downstream.
The unit that carries the rules is a policy. With unhardcoded, that policy is expressed as a policy_ir: a compact, structured term your backend generates and sends in the request body — the artifact every later section builds on.
Why hardcoded model names break down
A hardcoded model name is a decision you made once and can no longer see. It is fine when every call is identical and the right model never changes. In production, neither is true: your traffic is a mix of trivial classifications and genuinely hard reasoning, new models ship every few weeks, providers have outages, and finance wants to know where the money went. A static string answers none of that, and because it answers silently, the gap between the model you picked and the model a given call needed never shows up on a dashboard. It shows up on the invoice.
The dedicated argument piece, stop hardcoding model decisions, tallies the five quiet taxes a frozen string charges you — overpaying on easy traffic, redeploys for every model change, no automatic failover, no per-decision audit, and creeping lock-in — and walks the before/after of removing them. The throughline for this guide is what each tax has in common: they are all symptoms of a single decision made invisible at deploy time, which is exactly the decision a policy router pulls back into the open.
How a routing decision is made: filter, rank, select, mutate, fallback
A policy-based router does not guess. It runs every request through the same deterministic pipeline, in the same order, every time: filter the catalog against your hard constraints, rank the survivors by your objective, select the top one, mutate the call to fit the chosen model, and fallback in order if it fails. That fixed order is what turns a preference into a guarantee — and it is the same five-verb pipeline the policy_ir encodes one-to-one. For the verb-by-verb breakdown of what each stage does and how a single decision is evaluated, see what is a policy_ir?
The point worth making here is what that pipeline buys you that the plumbing below it cannot. Take a real-shaped case — summarizing support tickets, where a few threads need a large context window, you want the cheapest model that clears the bar, and PII must never leave your region. The router filters to in-region models above the context floor, ranks them by price, and lands on the cheapest survivor, with a known fallback if it is down; the router concept post walks the full worked catalog. Three things happen there that a gateway or marketplace would not do for you. First, the router chose the model from your rules rather than from a string you typed. Second, the choice was enforced: any out-of-region or undersized model was excluded, not merely deprioritized. Third, the decision was recorded, so when a finance reviewer asks why this request cost what it did, or an incident review asks which model served a bad answer at 3 a.m., the answer is in the trace — not reconstructed from memory. When a cheaper, in-policy model launches next quarter, you do not hunt down conditionals across services: the catalog updates, your policy stays the same, and the router starts selecting the better option on its own.
The policy_ir: rules as data, sent with the call
The pipeline above is only as trustworthy as the thing that describes the rules. With unhardcoded, that thing is the policy_ir: a small, structured term — a JSON array — that the router can parse, validate, fingerprint, and evaluate deterministically. Its fixed shape (the "policy" tag, an evidence slot, then the five working verbs) and the skeleton that lays out each position belong to what is a policy_ir? The anatomy of a routing decision — read that for the term itself.
What matters for the system as a whole is the three properties that structure unlocks. Because the policy is data rather than free-form English, the router can admit it before running it — rejecting unknown operations or undeclared fields instead of silently doing the wrong thing. It can fingerprint it, so the same rules produce the same identifier in your trace every time. And because it travels in the request body, the same endpoint can carry a different policy on the very next call; nothing is pinned to a server config your application cannot see. Those three are the bridge between the abstract pattern and the enforceability and auditability the next sections lean on: the policy router is the broad idea, the policy_ir is the concrete artifact that makes it hold.
Enforced, not advisory — and why that makes it auditable
The line that separates policy-based routing from a recommendation engine is enforcement: an advisory system suggests a model and trusts the caller to comply, so the rule is only a hint, while an enforced one removes any violating model from the candidate set before ranking begins, so the rule is a guarantee that holds whether or not anyone is watching. The router concept post draws out that distinction — and the quickstart shows the floor holding live, where a request with no passing model fails loudly instead of silently downgrading. Here the question is what enforcement makes possible everywhere else.
The answer is the trace. Because every decision passes through the same filter-rank-select machinery, the trace is not a best-effort log written after the fact — it is the actual record of the decision the router was structurally bound to make. You can replay it, hand it to a reviewer, or diff it against yesterday's behavior. That is what makes the routing auditable rather than merely logged, and it is the property that serves engineering, platform, and finance from a single record: engineering reads which rule the chosen model passed, platform verifies the floor held on every request, and finance gets per-decision cost attribution instead of one opaque provider invoice. Enforcement is what makes that one record load-bearing for all three.
Your keys, per-run pricing, no token resale
A routing layer sits in a sensitive spot: between your application and your providers. Two design choices determine whether it stays aligned with your interests. The first is whose keys run the inference. With unhardcoded, you bring your own provider keys and pay your providers directly. The router never holds your inference hostage and never inserts itself as the billing party for tokens.
The second is how the layer is paid. unhardcoded is priced per run, not per token: you pay for the routing decision and its trace, not a per-token margin. That keeps incentives honest — the router is paid to route well and stay current, not to profit from the models it picks. Its interest and yours point the same way: toward the cheapest model that meets your rules. The economics are spelled out on the pricing page. This is the structural difference from a token-reselling marketplace, where the layer profits more when you spend more.
How to get started without a rewrite
Adopting policy-based routing does not mean rebuilding your application. unhardcoded is OpenAI-compatible, so the migration is small and mechanical. You point your SDK's base URL at the endpoint, set any policy:* model string — it is just a free-form label your trace history is grouped under, not a routing instruction — and attach a policy_ir to the call. The routing comes entirely from that attached term; the model string is never parsed for a model. Your messages, tools, and parameters pass through unchanged.
// generated in your backend, at request time — the raw policy_ir term
const policy = [
"policy",
["ev_zero"], // evidence: reserved slot, none attached
// filter: meets request reqs, not disabled, supports tools, under the ceiling
["and", ["meets_req"], ["not", ["is", "disabled"]],
["has_cap", "supports_tools"], ["cmp", "price_out", "le", 6.0]],
// rank: cheapest = negate normalized output price
["neg", ["normalize", ["field", "price_out"]]],
["argmax"], // select: the single best survivor
["id"], // mutate: no-op
["always", { action: "next_candidate" }], // fallback
];
const res = await client.chat.completions.create({
model: "policy:support", // a free-form trace label, not a model name
policy_ir: policy, // the decision, sent with the call
messages, // unchanged
});
The raw term is the interface — a plain JSON array you can hand-write, generate, fingerprint, and replay. (A higher-level buildPolicy(...) helper that compiles a short spec down to this term is planned convenience sugar, not a shipping package yet; the array above is what the router actually admits.) That is the whole change: a base URL, a policy:* label, and a policy_ir on the call. The 5-minute quickstart takes it from there — sending the first policy, watching the router land on a model, and dry-running it before you commit — with the documentation open alongside.
How it relates to gateways, marketplaces, and proxies
"Router" is an overloaded word. Policy-based routing is a higher layer of abstraction that can sit on top of the plumbing you already know — it is not a competitor to it. The clean test is whether a component reasons about which model should serve the request or just moves bytes: if you still have to name the model, you have a gateway, a proxy, a balancer, or a marketplace; if you describe the rules the model must satisfy and let the system choose, you have a policy router.
- API gateways and proxies forward your request and add cross-cutting concerns — auth, rate limiting, retries, caching, logging. They answer "how do I reach this endpoint reliably?" You still name the model.
- Load balancers spread traffic across interchangeable backends. Their core assumption is that the targets are equivalent. Models are not equivalent — they differ in price, context window, capability, latency, and provider — so a policy router selects on those differences instead of treating them as noise.
- Model marketplaces and aggregators give you one key and one bill across many providers, usually by reselling tokens through their own billing. That solves access and billing, not the decision: you still pick the model and still pay a per-token margin.
For a tool-by-tool look at where each option puts the routing decision and what that means for your bill, read the comparison of unhardcoded versus Portkey, LiteLLM, and OpenRouter. And for where this is all heading — every LLM call eventually carrying its own portable, enforceable policy — see the runtime policy layer vision. The pattern is open at its core: the policy_ir is a format you can read and reason about, and the reference pieces are part of our open-source work.
The whole picture: policy, routing, fallback, trace, and workflows are one system — a written rule the router enforces and records on every call, over your own keys, for the cost of one line. The case for making the switch is laid out in stop hardcoding model decisions; the pieces it rests on are the posts linked throughout this guide.