Benchmark retention is not utility retention

June 6, 2026

I want to flesh out the point I made here about benchmark accuracy being a bad metric for evaluating model routing.

Users of model routing care about utility retention, not accuracy retention.

Let’s model the problem as utility retention rather than accuracy retention.

Routing benchmarks typically report a metric of the form:

Performance Retention=Router AccuracyBest Model Accuracy.\text{Performance Retention} = \frac{\text{Router Accuracy}}{\text{Best Model Accuracy}}.

For example, if the strongest model solves 100 benchmark tasks and a router solves 99 of them, the router is said to achieve 99% of the best model’s performance.

This implicitly assumes that every task carries equal value. Formally, if the benchmark contains tasks t1,,tnt_1,\ldots,t_n, benchmark accuracy is

1ni=1n1{task ti solved correctly}.\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\{\text{task } t_i \text{ solved correctly}\}.

This is equivalent to assigning every task a value of one.

In deployment, however, tasks have heterogeneous importance. Let

  • V(t)V(t) denote the value of solving task tt,
  • L(t)L(t) denote the loss incurred by failing task tt,
  • C(m,t)C(m,t) denote the inference cost of running model mm on task tt.

Then the relevant objective is not accuracy but expected utility:

U(m)=EtD[V(t)1correctL(t)1incorrectC(m,t)],U(m) = \mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(m,t)\right],

where DD is the real-world distribution of tasks.

The routing problem is therefore

maxR  EtD[V(t)1correctL(t)1incorrectC(R(t),t)],\max_{R}\;\mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(R(t),t)\right],

where R(t)R(t) is the model selected by the router.

Benchmark retention estimates

E[1correct],\mathbb{E}[\mathbf{1}_{\text{correct}}],

while deployment performance depends on

E[V(t)1correctL(t)1incorrect].\mathbb{E}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}}\right].

These coincide only in the special case where all tasks have identical value and identical failure costs.

In general, task importance is heavy-tailed. A router that achieves 99% benchmark retention may retain substantially less than 99% of deployment utility if the omitted 1% of tasks contains a disproportionate share of real-world value.

There is no monotonic relationship between benchmark accuracy retention and utility retention.

Consider a benchmark of nn tasks. Suppose task t1t_1 carries value MM, while each remaining task carries value 11. The strongest model solves all tasks, while the router fails only on t1t_1.

Then benchmark retention is

n1n,\frac{n-1}{n},

which approaches 11 as nn \to \infty.

However, utility retention is

n1M+n1,\frac{n-1}{M+n-1},

which approaches 00 as MM \to \infty.

Thus a router can achieve arbitrarily high benchmark retention while retaining arbitrarily little real-world utility.

It’s true our example assumes the router misses the most important task. But I claim merely that benchmark accuracy retention is decoupled from deployment utility retention, which doesn’t depend on the router necessarily missing the important task, only that it could miss important tasks.