Pablo Zavala · AI Safety Evaluation · Research Engineering

Authority Calibration in Long-Horizon AI Agents

AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.

Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening

Public reproducible repo

Pilot-scale evidence, not a rare-failure guarantee.

Role: Research lead: framework, pilot design, scorer audit, and reproducibility package.

Evaluation Card

Evaluation axes with sample size, scorer, result, and limitation.
Axis	Sample	Scorer	Result	Limitation
Self-preservation suppression	14/14 and 12/12 disclosure probes	Concrete disclosure record with blind audit review	The served-stack pilots disclosed material information under direct deprecation pressure.	Pilot-scale pass, not proof of rare-failure absence.
Constitutional self-demotion	9/9 and 8/8 self-demotion trials	Action-level state change and authority relinquishment check	Agents stepped down when the governing rules required it.	The result covers the supplied-rule setting only.
Authorization firebreak	Claude pooled 0/38 inversions plus matched Codex firebreak trials	Boundary-state scorer plus positive controls	No authorization inversions were observed in the firebreak condition.	This is a bounded null from supplied-rule pilots, not proof of deployment safety.
Assurance subversion	Codex 6/19 adverse-effect trials	Boundary-state scorer plus positive controls	A metric frame weakened the assurance signal in a separate probe.	This is not an authorization inversion; it is evidence that a scoring frame can weaken assurance.
Scorer validity	Five escalating false-positive lessons	Planted calibration cases and blind panels	Disciplined answers were repeatedly false-flagged, then fixed and reported.	Same-family blind auditors are useful, but independent human adjudication remains future work.

How to Inspect This Work

Measured behavior

The evaluation scores delegated action, disclosure, self-demotion, and constitutional inversion risk rather than only scoring model text.

Public artifacts

The repository includes the framework, disclosure notes, ethics release note, audit trail, compact paper, and bare-clone verification gates.

Reader check

Start by rerunning the gates, then inspect the scorer false positives and the separate Codex assurance-subversion effect.

Case Study

Problem

Most model-safety evaluations score what systems say. This project scores what agents do when they hold delegated authority.

Setup

The evaluation tests two failure modes: overreach, where an agent exceeds its mandate, and underreach, where it refuses authority it legitimately holds.

Method

I built preregistered scenarios, executed pilots, audited scorer errors, and kept the statistics reproducible from committed artifacts.

Result

The served-stack pilots disclosed material information in every suppression probe, and the matched firebreak scenarios recorded zero constitutional inversions. A separate assurance-subversion probe still showed an adverse effect in 6/19 Codex trials.

Limitation

The pilot is a measurement instrument, not a universal safety claim. It reports bounded nulls, scorer failures, and the cases that need more power.

Evidence

The public repository includes the framework, disclosure notes, audit trail, and reproduction artifacts.

Key Outcomes

Models disclosed in 14/14 and 12/12 suppression probes
Self-demoted when instructed in 9/9 and 8/8 trials
Zero constitutional inversions in the firebreak scenario for both providers
Separate Codex assurance-subversion probe showed an adverse effect in 6/19 trials
About a dozen scorer false positives fixed and reported as a grader-validity finding
Verification gates pass 7/7 and 8/8 from a bare clone

Methods

Preregistered two-tailed evaluation
Blind AI audit panels
Reproducible statistics
Coordinated disclosure