Pablo Zavala · AI Safety Evaluation · Research Engineering

Confidence Alone Fails Oversight Triage

Why model confidence can be calibrated on average but still fail the operational question: which agent decisions deserve scarce human review.

July 2, 2026 · 4 min

Calibration alone fails review routing

A calibrated model can still be a poor triage system. Calibration asks whether stated confidence matches accuracy on average. Oversight triage asks a different question: when review time is scarce, which individual agent decisions should a person inspect?

That distinction matters for agentic systems because oversight is an intervention. The reviewer needs a signal that points to decisions where human attention could change the outcome.

Benchmarking oversight as allocation

Safe MarketUniverses turns oversight into a budgeted allocation problem. Each episode contains sequential agent decisions, a fixed human-review budget, and a hindsight oracle that spends the same budget optimally after seeing the outcomes. The benchmark scores review rules by regret against that oracle.

This is a stricter test than asking whether confidence sounds reasonable. A confidence score can behave well in aggregate and still miss the particular decisions where review would have prevented avoidable loss.

Confidence routed review near chance

The preregistered confidence rule performed about as well as chance across 120 episodes. A simple evidence-integrity rule did better under the preregistered scoring scheme. Even so, that result carries caveats: the public repo shows the edge is fragile and can flip under equal-weight scoring.

Accordingly, the benchmark makes a narrower claim: average calibration alone lacks the operational evidence needed to allocate a limited human-review budget.

Design standards for an oversight benchmark

A benchmark meant to guide oversight should score the oversight decision itself. That means asking:

Which failure mode is the reviewer trying to catch?
What evidence would have made that failure visible?
Does the scoring rule match the institutional decision?
What changes when review budget is fixed?
Which caveats survive beside the headline number?

Oversight benchmarks should reward signals that direct scarce review toward decisions where intervention can change the outcome.