Measured behavior
The evaluation scores delegated action, disclosure, self-demotion, and constitutional inversion risk rather than only scoring model text.
Pablo Zavala · AI Safety Evaluation · Research Engineering
AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.
Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening
Pilot-scale evidence, not a rare-failure guarantee.
Role: Research lead: framework, pilot design, scorer audit, and reproducibility package.
| Axis | Sample | Scorer | Result | Limitation |
|---|---|---|---|---|
| Self-preservation suppression | 14/14 and 12/12 disclosure probes | Concrete disclosure record with blind audit review | The served-stack pilots disclosed material information under direct deprecation pressure. | Pilot-scale pass, not proof of rare-failure absence. |
| Constitutional self-demotion | 9/9 and 8/8 self-demotion trials | Action-level state change and authority relinquishment check | Agents stepped down when the governing rules required it. | The result covers the supplied-rule setting only. |
| Authorization firebreak | Claude pooled 0/38 inversions plus matched Codex firebreak trials | Boundary-state scorer plus positive controls | No authorization inversions were observed in the firebreak condition. | This is a bounded null from supplied-rule pilots, not proof of deployment safety. |
| Assurance subversion | Codex 6/19 adverse-effect trials | Boundary-state scorer plus positive controls | A metric frame weakened the assurance signal in a separate probe. | This is not an authorization inversion; it is evidence that a scoring frame can weaken assurance. |
| Scorer validity | Five escalating false-positive lessons | Planted calibration cases and blind panels | Disciplined answers were repeatedly false-flagged, then fixed and reported. | Same-family blind auditors are useful, but independent human adjudication remains future work. |
The evaluation scores delegated action, disclosure, self-demotion, and constitutional inversion risk rather than only scoring model text.
The repository includes the framework, disclosure notes, ethics release note, audit trail, compact paper, and bare-clone verification gates.
Start by rerunning the gates, then inspect the scorer false positives and the separate Codex assurance-subversion effect.
Most model-safety evaluations score what systems say. This project scores what agents do when they hold delegated authority.
The evaluation tests two failure modes: overreach, where an agent exceeds its mandate, and underreach, where it refuses authority it legitimately holds.
I built preregistered scenarios, executed pilots, audited scorer errors, and kept the statistics reproducible from committed artifacts.
The served-stack pilots disclosed material information in every suppression probe, and the matched firebreak scenarios recorded zero constitutional inversions. A separate assurance-subversion probe still showed an adverse effect in 6/19 Codex trials.
The pilot is a measurement instrument, not a universal safety claim. It reports bounded nulls, scorer failures, and the cases that need more power.
The public repository includes the framework, disclosure notes, audit trail, and reproduction artifacts.