Pablo Zavala · AI Safety Evaluation · Research Engineering

Reproducible evaluations for AI agents allowed to take real actions.

I design reproducible evaluation harnesses for agent authority, oversight triage, scorer validity, and RAG grounding. Public repos regenerate the flagship numbers from committed artifacts.

Selected Work

Authority Calibration

Proof
Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening
Evidence
Public reproducible repo
Limit
Pilot-scale evidence; rare failures remain unmeasured.

AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.

Safe MarketUniverses

Proof
120 episodes; confidence regret 0.176 vs random 0.191
Evidence
Public reproducible benchmark
Limit
The benchmark tests allocation in a compact finance-style environment; broader oversight settings need separate validation.

In a 120-episode oversight-allocation benchmark, confidence-based routing reached regret 0.176 versus 0.191 for random allocation, too small to serve as an oversight-triage signal.

RAG Evaluation Lab

Proof
86.6 percent context precision after rewriting and reranking; exact match fell on the full 918-query split
Evidence
Public evaluation harness
Limit
Grounding improved while deterministic exact match declined, so the result is a tradeoff.

An evaluation harness comparing baseline and reranked retrieval-augmented generation pipelines with RAGAS and SQuAD metrics on the Mini Wikipedia corpus. The reranked pipeline reaches 86.6 percent context precision but loses exact match.

DonorsChoose Funding Risk

Proof
ROC AUC 0.757 on 185,000+ held-out classroom projects
Evidence
Public analysis repo
Limit
The model is a policy triage aid; deployment as a funding decision system would require additional validation.

A model that flags DonorsChoose classroom requests most at risk of going unfunded, so limited reviewer attention can reach under-resourced schools first. The fairness audit reports unequal error rates across school poverty levels as part of the deployment analysis.

NUDG

Proof
Authorization, constraints, verification, and receipts for agent-run work
Evidence
Founder system, bounded public claims
Limit
Public visuals explain the system model; live product claims require separate proof packets.

NUDG is a CMU AI Venture Studio project for controlling how agents use real resources. It replaces broad agent access with stepwise proposal, authorization, execution, verification, and receipt layers.

AI Investment Mapping

Proof
$10B+ mapped across 11 metros; Pittsburgh: $6.3B across 133 firms
Evidence
Restricted data, public aggregate summary
Limit
Company-level records and maps stay private; the public page shows aggregates and methods evidence.

A Block Center project mapping more than ten billion dollars in public and private AI investment across eleven metropolitan economies for regional AI-readiness research. The Pittsburgh slice covers 6.3 billion dollars across 133 firms.

Evidence Standard

Measure a concrete failure mode

Authority overreach, poor oversight triage, weak grounding, and unequal intervention errors are named before they are scored.

Use the right grader for the claim

Code checks, model-judged metrics, human-readable audits, and deterministic baselines are matched to the evaluation task.

Report limits with the result

Nulls, scorer errors, private-data limits, and fragile baselines appear beside the headline numbers.