- Proof
- Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening
- Evidence
- Public reproducible repo
- Limit
- Pilot-scale evidence; rare failures remain unmeasured.
AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.
- Proof
- 120 episodes; confidence regret 0.176 vs random 0.191
- Evidence
- Public reproducible benchmark
- Limit
- The benchmark tests allocation in a compact finance-style environment; broader oversight settings need separate validation.
In a 120-episode oversight-allocation benchmark, confidence-based routing reached regret 0.176 versus 0.191 for random allocation, too small to serve as an oversight-triage signal.
- Proof
- 86.6 percent context precision after rewriting and reranking; exact match fell on the full 918-query split
- Evidence
- Public evaluation harness
- Limit
- Grounding improved while deterministic exact match declined, so the result is a tradeoff.
An evaluation harness comparing baseline and reranked retrieval-augmented generation pipelines with RAGAS and SQuAD metrics on the Mini Wikipedia corpus. The reranked pipeline reaches 86.6 percent context precision but loses exact match.
- Proof
- ROC AUC 0.757 on 185,000+ held-out classroom projects
- Evidence
- Public analysis repo
- Limit
- The model is a policy triage aid; deployment as a funding decision system would require additional validation.
A model that flags DonorsChoose classroom requests most at risk of going unfunded, so limited reviewer attention can reach under-resourced schools first. The fairness audit reports unequal error rates across school poverty levels as part of the deployment analysis.
- Proof
- Authorization, constraints, verification, and receipts for agent-run work
- Evidence
- Founder system, bounded public claims
- Limit
- Public visuals explain the system model; live product claims require separate proof packets.
NUDG is a CMU AI Venture Studio project for controlling how agents use real resources. It replaces broad agent access with stepwise proposal, authorization, execution, verification, and receipt layers.
- Proof
- $10B+ mapped across 11 metros; Pittsburgh: $6.3B across 133 firms
- Evidence
- Restricted data, public aggregate summary
- Limit
- Company-level records and maps stay private; the public page shows aggregates and methods evidence.
A Block Center project mapping more than ten billion dollars in public and private AI investment across eleven metropolitan economies for regional AI-readiness research. The Pittsburgh slice covers 6.3 billion dollars across 133 firms.