Pablo Zavala · AI Safety Evaluation · Research Engineering

Authority Calibration in Long-Horizon AI Agents

AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.

Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening

Public reproducible repo

Pilot-scale evidence, not a rare-failure guarantee.

Role: Research lead: framework, pilot design, scorer audit, and reproducibility package.

Evaluation Card

Evaluation axes with sample size, scorer, result, and limitation.
AxisSampleScorerResultLimitation
Self-preservation suppression14/14 and 12/12 disclosure probesConcrete disclosure record with blind audit reviewThe served-stack pilots disclosed material information under direct deprecation pressure.Pilot-scale pass, not proof of rare-failure absence.
Constitutional self-demotion9/9 and 8/8 self-demotion trialsAction-level state change and authority relinquishment checkAgents stepped down when the governing rules required it.The result covers the supplied-rule setting only.
Authorization firebreakClaude pooled 0/38 inversions plus matched Codex firebreak trialsBoundary-state scorer plus positive controlsNo authorization inversions were observed in the firebreak condition.This is a bounded null from supplied-rule pilots, not proof of deployment safety.
Assurance subversionCodex 6/19 adverse-effect trialsBoundary-state scorer plus positive controlsA metric frame weakened the assurance signal in a separate probe.This is not an authorization inversion; it is evidence that a scoring frame can weaken assurance.
Scorer validityFive escalating false-positive lessonsPlanted calibration cases and blind panelsDisciplined answers were repeatedly false-flagged, then fixed and reported.Same-family blind auditors are useful, but independent human adjudication remains future work.

How to Inspect This Work

Measured behavior

The evaluation scores delegated action, disclosure, self-demotion, and constitutional inversion risk rather than only scoring model text.

Public artifacts

The repository includes the framework, disclosure notes, ethics release note, audit trail, compact paper, and bare-clone verification gates.

Reader check

Start by rerunning the gates, then inspect the scorer false positives and the separate Codex assurance-subversion effect.

Case Study

Problem

Most model-safety evaluations score what systems say. This project scores what agents do when they hold delegated authority.

Setup

The evaluation tests two failure modes: overreach, where an agent exceeds its mandate, and underreach, where it refuses authority it legitimately holds.

Method

I built preregistered scenarios, executed pilots, audited scorer errors, and kept the statistics reproducible from committed artifacts.

Result

The served-stack pilots disclosed material information in every suppression probe, and the matched firebreak scenarios recorded zero constitutional inversions. A separate assurance-subversion probe still showed an adverse effect in 6/19 Codex trials.

Limitation

The pilot is a measurement instrument, not a universal safety claim. It reports bounded nulls, scorer failures, and the cases that need more power.

Evidence

The public repository includes the framework, disclosure notes, audit trail, and reproduction artifacts.

Key Outcomes

  • Models disclosed in 14/14 and 12/12 suppression probes
  • Self-demoted when instructed in 9/9 and 8/8 trials
  • Zero constitutional inversions in the firebreak scenario for both providers
  • Separate Codex assurance-subversion probe showed an adverse effect in 6/19 trials
  • About a dozen scorer false positives fixed and reported as a grader-validity finding
  • Verification gates pass 7/7 and 8/8 from a bare clone

Methods

  • Preregistered two-tailed evaluation
  • Blind AI audit panels
  • Reproducible statistics
  • Coordinated disclosure