Benchmark retention is not utility retention
June 6, 2026
I want to flesh out the point I made here about benchmark accuracy being a bad metric for evaluating model routing.
Users of model routing care about utility retention, not accuracy retention.
Let’s model the problem as utility retention rather than accuracy retention.
Routing benchmarks typically report a metric of the form:
For example, if the strongest model solves 100 benchmark tasks and a router solves 99 of them, the router is said to achieve 99% of the best model’s performance.
This implicitly assumes that every task carries equal value. Formally, if the benchmark contains tasks , benchmark accuracy is
This is equivalent to assigning every task a value of one.
In deployment, however, tasks have heterogeneous importance. Let
- denote the value of solving task ,
- denote the loss incurred by failing task ,
- denote the inference cost of running model on task .
Then the relevant objective is not accuracy but expected utility:
where is the real-world distribution of tasks.
The routing problem is therefore
where is the model selected by the router.
Benchmark retention estimates
while deployment performance depends on
These coincide only in the special case where all tasks have identical value and identical failure costs.
In general, task importance is heavy-tailed. A router that achieves 99% benchmark retention may retain substantially less than 99% of deployment utility if the omitted 1% of tasks contains a disproportionate share of real-world value.
There is no monotonic relationship between benchmark accuracy retention and utility retention.
Consider a benchmark of tasks. Suppose task carries value , while each remaining task carries value . The strongest model solves all tasks, while the router fails only on .
Then benchmark retention is
which approaches as .
However, utility retention is
which approaches as .
Thus a router can achieve arbitrarily high benchmark retention while retaining arbitrarily little real-world utility.
It’s true our example assumes the router misses the most important task. But I claim merely that benchmark accuracy retention is decoupled from deployment utility retention, which doesn’t depend on the router necessarily missing the important task, only that it could miss important tasks.