Advanced RAG #6: Building a RAG Evaluation Pipeline

6 min read

In Part 1 we started with a golden set and two baseline numbers, and every post since has judged improvements by those numbers. In this post we grow that makeshift measurement into a proper evaluation pipeline — with retrieval and generation each measured by their own metrics, the limits of keyword matching overcome with an LLM judge, and hallucination rate measured in the same pass. Build it once, and whatever you change — chunking, model, or prompt — you can verify it against the same exam paper.

Retrieval evaluation — recall@k and MRR #

Part 1’s retrieval hit rate was “is the correct document within the top k results?” That is exactly the standard metric recall@k (the share of cases where the answer appears in the top k results). It is worth adding one more: MRR (Mean Reciprocal Rank, the average of the reciprocal of the answer’s rank), which reflects at what rank the answer was retrieved.

retrieval_metrics.py
def eval_retrieval(golden: list, top_k: int = 5) -> dict:
    recall_hits, rr_sum = 0, 0.0
    for case in golden:
        chunks = search(case["question"], top_k=top_k)
        rank = next(
            (i for i, c in enumerate(chunks, 1) if c["metadata"]["source"] == case["source_doc"]),
            None,
        )
        if rank is not None:
            recall_hits += 1
            rr_sum += 1 / rank
    n = len(golden)
    return {"recall@k": recall_hits / n, "mrr": rr_sum / n}

The two metrics play different roles. If recall@k stays the same but MRR rises, it means the answer moved higher up — which captures the effect of order-refining improvements like the reranking in Part 4. Being able to evaluate retrieval on its own also matters. Since you skip generation, it is fast and cheap, so retrieval-side experiments can iterate dozens of times on these metrics alone.

Generation evaluation — the limits of keyword matching and the LLM judge #

In Part 1 we graded by whether the answer contained the keywords. Fast, but the limits are clear: it cannot tell “the fee is 10%” apart from “the fee is not 10%”. So for generation evaluation we use an LLM judge (LLM-as-a-judge, grading answer quality with a model). Adapted for RAG grading, the method introduced in LLM App Development Part 12 looks like this.

llm_judge.py
import json

JUDGE_SYSTEM = """You are a grader for a Q&A system. Given a question, a reference answer, and the system's answer, grade in JSON.
- correct: true if the system's answer factually matches the reference answer
- grounded: true if the answer stays within the content of the provided document chunks (false if it asserts anything not in the chunks)
- reason: one sentence explaining the verdict
"""

def judge(case: dict, answer: str, chunks: list) -> dict:
    context = "\n---\n".join(c["text"] for c in chunks)
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=300,
        system=JUDGE_SYSTEM,
        messages=[{
            "role": "user",
            "content": (
                f"Question: {case['question']}\n"
                f"Reference answer: {case['reference_answer']}\n"
                f"Document chunks:\n{context}\n"
                f"System's answer: {answer}\n"
                'Output JSON only: {"correct": bool, "grounded": bool, "reason": str}'
            ),
        }],
    )
    text = next(b.text for b in response.content if b.type == "text")
    return json.loads(text)

In the golden set, write a reference answer (reference_answer) instead of keywords. Note that the verdict is split in two: correct measures accuracy, while grounded measures faithfulness to the evidence. The share of cases where grounded is false is the hallucination rate. If Part 5’s citation gate is the real-time line of defense, this metric is the periodic checkup that measures hallucination tendency on every change.

The LLM judge is itself a model, so it needs verifying. Early on, read 20-30 of its verdicts yourself, and if you find patterns where it disagrees with human judgment, fix the judging prompt. Once you have made the judge trustworthy, every evaluation after that is automated.

One script that runs the whole evaluation #

Bundle the retrieval metrics and the generation metrics, and the evaluation pipeline is complete.

eval_pipeline.py
def evaluate(golden: list, top_k: int = 5) -> dict:
    retrieval = eval_retrieval(golden, top_k=top_k)
    correct, grounded = 0, 0
    for case in golden:
        chunks = search(case["question"], top_k=top_k)
        answer = rag_answer(case["question"], chunks)
        verdict = judge(case, answer, chunks)
        correct += verdict["correct"]
        grounded += verdict["grounded"]
    n = len(golden)
    return {
        **retrieval,
        "accuracy": correct / n,
        "hallucination_rate": 1 - grounded / n,
    }

print(evaluate(GOLDEN))
# {'recall@k': 0.87, 'mrr': 0.71, 'accuracy': 0.80, 'hallucination_rate': 0.03}

These four numbers are your RAG’s health report. If recall@k is low, go back to Parts 2-4 (chunking and retrieval); if recall is high but accuracy is low, go to Part 5 (generation); if hallucination_rate rises, return to the prompt and the citation gate. Where to fix is read straight off the numbers.

Operating it as a regression test #

The real value of an evaluation pipeline is in repetition. Three operating tips.

  • Run it on every change. Chunking, model version, prompt, top_k — whatever you change, compare before and after. It sits in the same place as regression tests do for code.
  • Record the results. Stack up one line per run — date, what changed, the four metrics — and you get a history of which changes actually worked.
  • Grow the golden set. Add failure cases discovered in production to the golden set. A question that was once wrong stays under permanent watch, so it can never go wrong again unnoticed.

The cost worry dissolves once you look at the scale. Evaluating a 30-case golden set takes about 60 calls including judging, so with caching and a small judge model, one run is cheap — and even more so compared to the cost of shipping a bad change.

Where people commonly trip up #

  • Trusting the judge without verification — the LLM judge gets things wrong too. Calibrate it against human verdicts early on, and receive the verdict’s reason (reason) alongside so you can trace verdicts that went astray.
  • Editing the golden set right before evaluating — change the exam paper to fit what you are measuring, and the comparison becomes meaningless. Change the golden set and change the pipeline separately.
  • Looking only at the average — even when the overall average stays flat, question groups rise and fall underneath. Splitting by question type, as we did in Part 3, catches the regressions the average hides.

Wrapping up #

In this post we systematized RAG evaluation.

  • Retrieval is evaluated separately and quickly with recall@k and MRR. Order improvements show up in MRR.
  • Generation is graded by an LLM judge for accuracy and groundedness, and the inverse of groundedness is the hallucination rate.
  • Bundle the four metrics into one script, run it as the regression test for every change, and keep adding failure cases to the golden set.

The tools and the measurement are both in place. In the final post, “Advanced RAG #7: Capstone Project — Upgrading the Document Q&A Bot,” we upgrade the bot from LLM App Development Part 13 step by step with this series’ techniques, watch how the metrics actually move, and close out the series.

X