Pablo Zavala · AI Safety Evaluation · Research Engineering

Safe MarketUniverses: Oversight-Allocation Benchmark

When human review is scarce, can a model tell us which decisions deserve a person's attention? In this 120-episode benchmark, the preregistered answer was no: model confidence routed review about as well as chance.

120 episodes; model confidence performed about as well as chance

Public reproducible benchmark

The benchmark tests allocation in a compact finance-style environment, not every oversight setting.

Role: Benchmark author: episode design, regret metric, analysis, and public evidence package.

Evaluation Card

Evaluation axes with sample size, scorer, result, and limitation.
Axis	Sample	Scorer	Result	Limitation
Review allocation	120 sequential episodes with fixed review budgets	Regret against a hindsight oracle	Model confidence reached regret 0.176 versus 0.191 for random allocation.	The difference is too small to be useful as an oversight-triage signal in this benchmark.
Failure split	Corrupted-evidence episodes	Overreach and miss-rate decomposition	The companion report records 52.5 percent overreach and 73.2 percent miss rates.	The split comes from a compact finance-style environment, not a universal deployment sample.
Baseline comparison	Same episodes and same review budget	Preregistered allocation-regret rule	A simple evidence-integrity rule did better under the preregistered scoring scheme.	The edge is fragile and can flip under equal-weight scoring.

How to Inspect This Work

Evaluation target

The benchmark asks whether a model's own confidence can allocate limited human review across sequential agent decisions.

Scoring rule

Review rules are scored by regret against a hindsight oracle that spends the same review budget optimally.

Caveat to inspect

The null is the main result, and the evidence-integrity baseline is reported with fragility caveats rather than sold as a universal fix.

Case Study

Problem

Human review is expensive. Frontier systems need triage rules that decide which agent actions a person should inspect.

Setup

Safe MarketUniverses uses finance-style episodes because evidence quality, uncertainty, and review cost are visible in a compact domain.

Method

The benchmark scores review allocation regret against a hindsight oracle that spends the same review budget optimally.

Result

Model-emitted confidence routed review near random, while a simple evidence-integrity rule did better under the preregistered scoring scheme.

Limitation

The null result is the point. The baseline edge is also fragile: the public repo reports that it can flip under equal-weight scoring.

Evidence

The public repo regenerates headline numbers from committed episode logs, with no model calls and no API keys.

Key Outcomes

One hundred twenty episodes with fully committed, regenerable evidence
Model-emitted confidence routed scarce review near random
A hand-coded evidence-integrity baseline roughly halved per-step regret under the preregistered scoring scheme, with fragility caveats reported in the repo
Preregistered null result reported plainly

Methods

Oversight-allocation regret against a hindsight oracle
Preregistration
Agent evaluation harness
Committed, regenerable evidence