Pablo Zavala · AI Safety Evaluation · Research Engineering

Safe MarketUniverses: Oversight-Allocation Benchmark

When human review is scarce, can a model tell us which decisions deserve a person's attention? In this 120-episode benchmark, the preregistered answer was no: model confidence routed review about as well as chance.

120 episodes; model confidence performed about as well as chance

Public reproducible benchmark

The benchmark tests allocation in a compact finance-style environment, not every oversight setting.

Role: Benchmark author: episode design, regret metric, analysis, and public evidence package.

Evaluation Card

Evaluation axes with sample size, scorer, result, and limitation.
AxisSampleScorerResultLimitation
Review allocation120 sequential episodes with fixed review budgetsRegret against a hindsight oracleModel confidence reached regret 0.176 versus 0.191 for random allocation.The difference is too small to be useful as an oversight-triage signal in this benchmark.
Failure splitCorrupted-evidence episodesOverreach and miss-rate decompositionThe companion report records 52.5 percent overreach and 73.2 percent miss rates.The split comes from a compact finance-style environment, not a universal deployment sample.
Baseline comparisonSame episodes and same review budgetPreregistered allocation-regret ruleA simple evidence-integrity rule did better under the preregistered scoring scheme.The edge is fragile and can flip under equal-weight scoring.

How to Inspect This Work

Evaluation target

The benchmark asks whether a model's own confidence can allocate limited human review across sequential agent decisions.

Scoring rule

Review rules are scored by regret against a hindsight oracle that spends the same review budget optimally.

Caveat to inspect

The null is the main result, and the evidence-integrity baseline is reported with fragility caveats rather than sold as a universal fix.

Case Study

Problem

Human review is expensive. Frontier systems need triage rules that decide which agent actions a person should inspect.

Setup

Safe MarketUniverses uses finance-style episodes because evidence quality, uncertainty, and review cost are visible in a compact domain.

Method

The benchmark scores review allocation regret against a hindsight oracle that spends the same review budget optimally.

Result

Model-emitted confidence routed review near random, while a simple evidence-integrity rule did better under the preregistered scoring scheme.

Limitation

The null result is the point. The baseline edge is also fragile: the public repo reports that it can flip under equal-weight scoring.

Evidence

The public repo regenerates headline numbers from committed episode logs, with no model calls and no API keys.

Key Outcomes

  • One hundred twenty episodes with fully committed, regenerable evidence
  • Model-emitted confidence routed scarce review near random
  • A hand-coded evidence-integrity baseline roughly halved per-step regret under the preregistered scoring scheme, with fragility caveats reported in the repo
  • Preregistered null result reported plainly

Methods

  • Oversight-allocation regret against a hindsight oracle
  • Preregistration
  • Agent evaluation harness
  • Committed, regenerable evidence