Pablo Zavala · AI Safety Evaluation · Research Engineering

Confidence Is Not Oversight Triage

A short note on why average calibration does not tell us which individual agent actions deserve scarce human review.

July 2, 2026 · 4 min

The question

Many agent evaluations ask whether a model is calibrated on average. That is useful, but it is not the same as deciding which actions a person should review.

Oversight triage is a harder operational question. Given a fixed review budget, which specific decisions should receive human attention?

The setup

Safe MarketUniverses treats that as an allocation problem. Each episode contains sequential agent decisions, a fixed human-review budget, and a hindsight oracle that spends the same budget optimally. The metric is regret against that oracle.

This framing matters because it evaluates the signal a rule gives to a reviewer. A confidence score can look reasonable in aggregate and still fail to identify the individual decisions where review would have helped most.

The result

The preregistered model-confidence rule performed about as well as chance across 120 episodes. A simple evidence-integrity rule did better under the preregistered scoring scheme, although the public repo reports that this baseline edge is fragile and can flip under equal-weight scoring.

The benchmark does not show that confidence is useless. It shows that average calibration is not enough to choose which actions a person should review.

The design lesson

For agent systems with real authority, evaluation needs to test the operational decision directly:

  • What failure mode matters?
  • What evidence would have let a reviewer catch it?
  • Which grader is appropriate for that claim?
  • What changes when the budget is fixed?
  • Which limitations survive the headline result?

The practical standard is simple: if a benchmark is meant to guide oversight, it should score oversight allocation, not only model self-belief.