- Proof
- Public pilot for agent authority use: disclosure observed in 14/14 and 12/12 suppression probes; self-demotion in 9/9 and 8/8 supplied-rule trials; 0 observed matched firebreak inversions; separate Codex probe found 6/19 assurance weakening
- Evidence
- Public reproducible repo
- Limit
- Pilot-scale evidence, not a rare-failure guarantee.
AI agents increasingly run pipelines, allocate resources, and coordinate other agents. This project tests whether they use delegated authority correctly: neither exceeding their mandate nor refusing authority they actually hold.
- Proof
- 120 episodes; model confidence performed about as well as chance
- Evidence
- Public reproducible benchmark
- Limit
- The benchmark tests allocation in a compact finance-style environment, not every oversight setting.
When human review is scarce, can a model tell us which decisions deserve a person's attention? In this 120-episode benchmark, the preregistered answer was no: model confidence routed review about as well as chance.
- Proof
- 86.6 percent context precision after rewriting and reranking; exact match fell on the full 918-query split
- Evidence
- Public evaluation harness
- Limit
- Grounding improved while deterministic exact match declined, so the result is a tradeoff.
An evaluation harness comparing baseline and reranked retrieval-augmented generation pipelines with RAGAS and SQuAD metrics on the Mini Wikipedia corpus. The reranked pipeline reaches 86.6 percent context precision but loses exact match.
- Proof
- ROC AUC 0.757 on 185,000+ held-out classroom projects
- Evidence
- Public analysis repo
- Limit
- The model is a policy triage aid, not a deployment-ready funding decision system.
A model that flags DonorsChoose classroom requests most at risk of going unfunded, so limited reviewer attention can reach under-resourced schools first. The fairness audit reports unequal error rates across school poverty levels rather than presenting only an average score.
- Proof
- Authorization, constraints, verification, and receipts for agent-run work
- Evidence
- Founder system, bounded public claims
- Limit
- Public visuals explain the system model; live product claims require separate proof packets.
NUDG is a CMU AI Venture Studio project for controlling how agents use real resources. It separates proposal, authorization, execution, verification, and audit records instead of giving agents broad access up front.
- Proof
- $10B+ mapped across 11 metros; Pittsburgh: $6.3B across 133 firms
- Evidence
- Restricted data, public aggregate summary
- Limit
- Company-level records and maps stay private; the public page shows aggregates and methods evidence.
A Block Center project mapping more than ten billion dollars in public and private AI investment across eleven metropolitan economies for regional AI-readiness research. The Pittsburgh slice covers 6.3 billion dollars across 133 firms.
- Proof
- 14.3 percent vs 3.6 percent peak unemployment under paired policy regimes
- Evidence
- Public simulation repo
- Limit
- Mechanism demonstration in a small simulated labor market, not a macro forecast.
An agent-based NetLogo model of a small labor market adjusting to AI automation. With identical workers, geography, and random seed, peak unemployment reaches 14.3 percent under a tech-driven policy regime versus 3.6 percent under a human-centric one.
- Proof
- Prototype architecture for an auditable input store and privacy-preserving public extracts
- Evidence
- Private pilot, synthetic public artifact
- Limit
- Public visuals use synthetic text so community messages stay private.
An early civic-listening pilot with Professor Jordan Usdan of Heinz College. It separates raw community input from published output: input enters an auditable store, while public extracts pass privacy and integrity checks.
- Proof
- Claude vision extracts flyer cards; static export runs with a sample board and no backend
- Evidence
- Live static demo
- Limit
- The demo proves the sample-board extraction workflow, not live campus coverage.
A prototype that turns a photo of a campus poster wall into structured, personalized listings. Claude vision extracts one listing per flyer, a deterministic in-browser ranker orders the results by chosen interests, and the app ships as a static export with no backend.
- Proof
- Hour-by-hour cashflow model with thermostat, solar, and battery portfolio search
- Evidence
- Request-only capstone artifact
- Limit
- The public artifact shows the function flow; full capstone materials are private.
A Streamlit planning tool for ERCOT demand response. It compares thermostat, solar, and battery portfolios hour by hour through benefit-cost cash-flow analysis.
- Proof
- Course report: Isolation Forests and UMAP over 8M+ kernel-level security events
- Evidence
- Coursework, request-only report
- Limit
- Detailed materials are available by request; the public page uses a compact evidence card.
A Carnegie Mellon coursework project using Isolation Forests and UMAP on the BETH kernel-level security-events dataset. The course report records 95 percent accuracy; detailed validation materials are available by request.