Evaluation target
The benchmark asks whether a model's own confidence can allocate limited human review across sequential agent decisions.
Pablo Zavala · AI Safety Evaluation · Research Engineering
When human review is scarce, can a model tell us which decisions deserve a person's attention? In this 120-episode benchmark, the preregistered answer was no: model confidence routed review about as well as chance.
120 episodes; model confidence performed about as well as chance
The benchmark tests allocation in a compact finance-style environment, not every oversight setting.
Role: Benchmark author: episode design, regret metric, analysis, and public evidence package.
| Axis | Sample | Scorer | Result | Limitation |
|---|---|---|---|---|
| Review allocation | 120 sequential episodes with fixed review budgets | Regret against a hindsight oracle | Model confidence reached regret 0.176 versus 0.191 for random allocation. | The difference is too small to be useful as an oversight-triage signal in this benchmark. |
| Failure split | Corrupted-evidence episodes | Overreach and miss-rate decomposition | The companion report records 52.5 percent overreach and 73.2 percent miss rates. | The split comes from a compact finance-style environment, not a universal deployment sample. |
| Baseline comparison | Same episodes and same review budget | Preregistered allocation-regret rule | A simple evidence-integrity rule did better under the preregistered scoring scheme. | The edge is fragile and can flip under equal-weight scoring. |
The benchmark asks whether a model's own confidence can allocate limited human review across sequential agent decisions.
Review rules are scored by regret against a hindsight oracle that spends the same review budget optimally.
The null is the main result, and the evidence-integrity baseline is reported with fragility caveats rather than sold as a universal fix.
Human review is expensive. Frontier systems need triage rules that decide which agent actions a person should inspect.
Safe MarketUniverses uses finance-style episodes because evidence quality, uncertainty, and review cost are visible in a compact domain.
The benchmark scores review allocation regret against a hindsight oracle that spends the same review budget optimally.
Model-emitted confidence routed review near random, while a simple evidence-integrity rule did better under the preregistered scoring scheme.
The null result is the point. The baseline edge is also fragile: the public repo reports that it can flip under equal-weight scoring.
The public repo regenerates headline numbers from committed episode logs, with no model calls and no API keys.