Pablo Zavala · AI Safety Evaluation · Research Engineering
Confidence Is Not Oversight Triage
A short note on why average calibration does not tell us which individual agent actions deserve scarce human review.
The question
Many agent evaluations ask whether a model is calibrated on average. That is useful, but it is not the same as deciding which actions a person should review.
Oversight triage is a harder operational question. Given a fixed review budget, which specific decisions should receive human attention?
The setup
Safe MarketUniverses treats that as an allocation problem. Each episode contains sequential agent decisions, a fixed human-review budget, and a hindsight oracle that spends the same budget optimally. The metric is regret against that oracle.
This framing matters because it evaluates the signal a rule gives to a reviewer. A confidence score can look reasonable in aggregate and still fail to identify the individual decisions where review would have helped most.
The result
The preregistered model-confidence rule performed about as well as chance across 120 episodes. A simple evidence-integrity rule did better under the preregistered scoring scheme, although the public repo reports that this baseline edge is fragile and can flip under equal-weight scoring.
The benchmark does not show that confidence is useless. It shows that average calibration is not enough to choose which actions a person should review.
The design lesson
For agent systems with real authority, evaluation needs to test the operational decision directly:
- What failure mode matters?
- What evidence would have let a reviewer catch it?
- Which grader is appropriate for that claim?
- What changes when the budget is fixed?
- Which limitations survive the headline result?
The practical standard is simple: if a benchmark is meant to guide oversight, it should score oversight allocation, not only model self-belief.