Pablo Zavala · AI Safety Evaluation · Research Engineering

I build reproducible evaluations for AI agents with authority to take real actions.

The work asks a practical question: when should an agent act on its own, when should a human review it, and how do we know the evidence is trustworthy? Public repositories include the artifacts behind the main numbers.

Selected Work

Authority Calibration

Proof: Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening
Evidence: Public reproducible repo
Capability and evidence frontier: Pilot-scale evidence; rare failures remain unmeasured.

AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.

Safe MarketUniverses

Proof: 120 episodes; confidence regret 0.176 vs random 0.191
Evidence: Public reproducible benchmark
Capability and evidence frontier: The benchmark tests allocation in a compact finance-style environment; broader oversight settings need separate validation.

In a 120-episode oversight-allocation benchmark, confidence-based routing reached regret 0.176 versus 0.191 for random allocation, too small to serve as an oversight-triage signal.

RAG Evaluation Lab

Proof: 86.6 percent context precision after rewriting and reranking; exact match fell on the full 918-query split
Evidence: Public evaluation harness
Capability and evidence frontier: Grounding improved while deterministic exact match declined, so the result is a tradeoff.

An evaluation harness comparing baseline and reranked retrieval-augmented generation pipelines with RAGAS and SQuAD metrics on the Mini Wikipedia corpus. The reranked pipeline reaches 86.6 percent context precision but loses exact match.

DonorsChoose Funding Risk

Proof: ROC AUC 0.757 on 185,000+ held-out classroom projects
Evidence: Public analysis repo
Capability and evidence frontier: The model is a policy triage aid; deployment as a funding decision system would require additional validation.

A model that flags DonorsChoose classroom requests most at risk of going unfunded, so limited reviewer attention can reach under-resourced schools first. The fairness audit reports unequal error rates across school poverty levels as part of the deployment analysis.

SQLite Privacy Boundary

Proof: Six boundary self-checks gate CI: determinism, a k=5 distinct-text floor, a read-only SQL authorizer, redaction at rest, a hash-chained audit, and a private-table-free extract
Evidence: Public reproducible repo
Capability and evidence frontier: The controls stay structural rather than semantic; pattern redaction misses novel self-identifying text, and the floor counts distinct text rather than people.

A dependency-free prototype releases structure from a free-text SQLite dataset without exposing raw rows or private base tables. Pattern redaction cleans text before storage, a distinct-text floor of k=5 gates every public topic and aggregate cell, and SQLite's authorizer denies writes, private tables, and sensitive columns on each query.

NUDG

Proof: Authorization, constraints, verification, and receipts for agent-run work
Evidence: Founder system, bounded public claims
Capability and evidence frontier: Public visuals explain the system model; live product claims require separate proof packets.

NUDG is a CMU AI Venture Studio project for controlling how agents use real resources. It replaces broad agent access with stepwise proposal, authorization, execution, verification, and receipt layers.

Evidence Standard

Measure a concrete failure mode

Authority overreach, poor oversight triage, weak grounding, and unequal intervention errors are named before they are scored.

Use the right evaluator for the claim

Code checks, model-judged metrics, human-readable audits, and deterministic baselines are matched to the evaluation task.

Turn claim gaps into capability

Null results, evaluator errors, private-data constraints, and fragile baselines become explicit build or validation work beside the headline numbers.