Pablo Zavala · AI Safety Evaluation · Research Engineering

RAG Evaluation Lab

An evaluation harness comparing baseline and reranked retrieval-augmented generation pipelines with RAGAS and SQuAD metrics on the Mini Wikipedia corpus. The reranked pipeline reaches 86.6 percent context precision but loses exact match.

86.6 percent context precision after rewriting and reranking; exact match fell on the full 918-query split

Public evaluation harness

Grounding improved while deterministic exact match declined, so the result is a tradeoff.

Role: Evaluation harness builder: retrieval variants, metrics, confidence intervals, and report.

Evaluation Card

Evaluation axes with sample size, scorer, result, and limitation.
AxisSampleScorerResultLimitation
Grounding100-query RAGAS sliceGPT-4o-mini judge via RAGASContext precision rose from 69.0 percent to 86.6 percent after query rewriting and reranking.The RAGAS slice is cost-bounded and judge-model dependent.
Faithfulness100-query RAGAS sliceGPT-4o-mini judge via RAGASFaithfulness rose from 67.6 percent to 78.5 percent on the same evaluation slice.A different judge or rubric could shift absolute scores.
Answer overlapFull 918-query Mini Wikipedia test splitDeterministic SQuAD exact match and F1Exact match fell from 41.50 percent to 33.66 percent, with z = 3.47 and p < 0.001.The enhancement improves grounding while hurting literal answer overlap.

How to Inspect This Work

Metric contrast

The page puts RAGAS grounding gains beside exact-match losses, so the result is not judged by a single favorable metric.

Artifact trail

The repo includes metric CSVs, JSON outputs, notebooks, a technical report, and a scripted pipeline for rerunning the comparison.

Reader check

Inspect the RAGAS sample-size limit, judge-model dependency, and full-split SQuAD scores before interpreting the result.

Case Study

Problem

RAG systems can look better or worse depending on whether the evaluator checks grounding, literal answer overlap, or both.

Setup

The harness compares a naive retrieval pipeline with an enhanced query-rewriting and reranking pipeline on Mini Wikipedia.

Method

It scores a 100-query slice with RAGAS metrics, then checks the full 918-query split with deterministic SQuAD metrics and Wilson intervals.

Result

The enhanced pipeline raises context precision from 69.0 percent to 86.6 percent and raises faithfulness from 67.6 percent to 78.5 percent.

Limitation

Exact-match overlap falls by a statistically significant margin, so the project reports grounding gains and answer-overlap costs together.

Evidence

The public repository includes the metric CSV, notebook, and report materials used to regenerate the comparison.

Key Outcomes

  • Context precision rose from 69.0 percent to 86.6 percent with query rewriting and reranking
  • Faithfulness rose from 67.6 percent to 78.5 percent on the same evaluation slice
  • Exact match fell by a statistically significant margin on the full test split

Methods

  • RAGAS metrics
  • SQuAD metrics
  • Cross-encoder reranking
  • Wilson confidence intervals