Benchmark retention is not utility retention

June 6, 2026

I want to flesh out the point I made here about benchmark accuracy being a bad metric for evaluating model routing.

Users of model routing care about utility retention, not accuracy retention.

Let’s model the problem as utility retention rather than accuracy retention.

Routing benchmarks typically report a metric of the form:

\text{Performance Retention} = \frac{\text{Router Accuracy}}{\text{Best Model Accuracy}}.

For example, if the strongest model solves 100 benchmark tasks and a router solves 99 of them, the router is said to achieve 99% of the best model’s performance.

This implicitly assumes that every task carries equal value. Formally, if the benchmark contains tasks $t_1,\ldots,t_n$ , benchmark accuracy is

\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\{\text{task } t_i \text{ solved correctly}\}.

This is equivalent to assigning every task a value of one.

In deployment, however, tasks have heterogeneous importance. Let

$V(t)$ denote the value of solving task $t$ ,
$L(t)$ denote the loss incurred by failing task $t$ ,
$C(m,t)$ denote the inference cost of running model $m$ on task $t$ .

Then the relevant objective is not accuracy but expected utility:

U(m) = \mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(m,t)\right],

where $D$ is the real-world distribution of tasks.

The routing problem is therefore

\max_{R}\;\mathbb{E}_{t \sim D}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}} - C(R(t),t)\right],

where $R(t)$ is the model selected by the router.

Benchmark retention estimates

\mathbb{E}[\mathbf{1}_{\text{correct}}],

while deployment performance depends on

\mathbb{E}\left[V(t)\mathbf{1}_{\text{correct}} - L(t)\mathbf{1}_{\text{incorrect}}\right].

These coincide only in the special case where all tasks have identical value and identical failure costs.

In general, task importance is heavy-tailed. A router that achieves 99% benchmark retention may retain substantially less than 99% of deployment utility if the omitted 1% of tasks contains a disproportionate share of real-world value.

There is no monotonic relationship between benchmark accuracy retention and utility retention.

Consider a benchmark of $n$ tasks. Suppose task $t_1$ carries value $M$ , while each remaining task carries value $1$ . The strongest model solves all tasks, while the router fails only on $t_1$ .

Then benchmark retention is

\frac{n-1}{n},

which approaches $1$ as $n \to \infty$ .

However, utility retention is

\frac{n-1}{M+n-1},

which approaches $0$ as $M \to \infty$ .

Thus a router can achieve arbitrarily high benchmark retention while retaining arbitrarily little real-world utility.

It’s true our example assumes the router misses the most important task. But I claim merely that benchmark accuracy retention is decoupled from deployment utility retention, which doesn’t depend on the router necessarily missing the important task, only that it could miss important tasks.