Metric contrast
The page puts RAGAS grounding gains beside exact-match losses, so the result is not judged by a single favorable metric.
Pablo Zavala · AI Safety Evaluation · Research Engineering
An evaluation harness comparing baseline and reranked retrieval-augmented generation pipelines with RAGAS and SQuAD metrics on the Mini Wikipedia corpus. The reranked pipeline reaches 86.6 percent context precision but loses exact match.
86.6 percent context precision after rewriting and reranking; exact match fell on the full 918-query split
Grounding improved while deterministic exact match declined, so the result is a tradeoff.
Role: Evaluation harness builder: retrieval variants, metrics, confidence intervals, and report.
| Axis | Sample | Scorer | Result | Limitation |
|---|---|---|---|---|
| Grounding | 100-query RAGAS slice | GPT-4o-mini judge via RAGAS | Context precision rose from 69.0 percent to 86.6 percent after query rewriting and reranking. | The RAGAS slice is cost-bounded and judge-model dependent. |
| Faithfulness | 100-query RAGAS slice | GPT-4o-mini judge via RAGAS | Faithfulness rose from 67.6 percent to 78.5 percent on the same evaluation slice. | A different judge or rubric could shift absolute scores. |
| Answer overlap | Full 918-query Mini Wikipedia test split | Deterministic SQuAD exact match and F1 | Exact match fell from 41.50 percent to 33.66 percent, with z = 3.47 and p < 0.001. | The enhancement improves grounding while hurting literal answer overlap. |
The page puts RAGAS grounding gains beside exact-match losses, so the result is not judged by a single favorable metric.
The repo includes metric CSVs, JSON outputs, notebooks, a technical report, and a scripted pipeline for rerunning the comparison.
Inspect the RAGAS sample-size limit, judge-model dependency, and full-split SQuAD scores before interpreting the result.
RAG systems can look better or worse depending on whether the evaluator checks grounding, literal answer overlap, or both.
The harness compares a naive retrieval pipeline with an enhanced query-rewriting and reranking pipeline on Mini Wikipedia.
It scores a 100-query slice with RAGAS metrics, then checks the full 918-query split with deterministic SQuAD metrics and Wilson intervals.
The enhanced pipeline raises context precision from 69.0 percent to 86.6 percent and raises faithfulness from 67.6 percent to 78.5 percent.
Exact-match overlap falls by a statistically significant margin, so the project reports grounding gains and answer-overlap costs together.
The public repository includes the metric CSV, notebook, and report materials used to regenerate the comparison.