Advanced RAG #1: Start by Finding Where RAG Goes Wrong

Thursday, June 18, 2026

5 min read

In Part 8 of LLM App Development we built a basic RAG pipeline, and in Part 13 we used it to complete an internal document Q&A bot. But once you actually run it, something feels off. It nails some questions and gives nonsense on others, and when you touch the prompt to fix one, a different question gets worse. This series is seven parts on systematically improving that “almost-good RAG.”

Here is the order. Failure diagnosis and a baseline (Part 1), chunking strategies (Part 2), hybrid search (Part 3), query transformation and reranking (Part 4), reducing hallucinations with citations (Part 5), and an evaluation pipeline (Part 6). In the final Part 7 we upgrade the Q&A bot from Part 13 step by step and watch the numbers move.

The topic of this first part is not an improvement technique but diagnosis. If you start fixing without knowing where things broke, you cannot even tell whether they got better.

RAG fails in two places #

When a RAG answer is wrong, the cause falls into two broad branches.

Failure type	Symptom	Where to fix
Retrieval failure	The chunk containing the answer was never fetched	Chunking, search (Parts 2-4)
Generation failure	The right chunk was fetched, but the answer is wrong	Prompt, citations (Part 5)

This distinction matters because the remedies are opposites. Fixing the prompt for a retrieval failure is wasted effort, and swapping the embedding model for a generation failure is just as pointless. And in practice, the majority of failures you meet are retrieval failures. Part 8 made the same point in one line: if the wrong chunks come back, it does not matter how well Claude answers.

Your first diagnostic tool is your eyes #

Before any fancy tooling, the most effective diagnosis is looking at the retrieval results directly. For a question that produced a wrong answer, print the chunks the search brought back, as is.

inspect_retrieval.py

def inspect(question: str, top_k: int = 5):
    chunks = search(question, top_k=top_k)   # the search function from Part 8
    print(f"Question: {question}\n")
    for i, c in enumerate(chunks, 1):
        print(f"--- Chunk {i} (similarity {c.score:.3f}) ---")
        print(c.text[:200])
        print()

inspect("How much is the refund fee?")

Just check whether the answer is inside those chunks or not, and retrieval failure versus generation failure splits on the spot. If the answer is not there, dig into retrieval; if it is there but the answer was wrong, dig into generation. Look at just ten failed questions this way and patterns start to appear. Retrieval misses only on proper-noun questions, or only content that lived inside tables goes missing, and so on.

Building a golden set #

Eyeball diagnosis is fast, but to confirm that an improvement is a real improvement, you have to keep measuring with the same exam paper. That exam paper is a golden set: a bundle of evaluation data where the correct answers are fixed in advance. For RAG, you record the question, the answer, and the source of the chunk that contains the answer.

golden_set.py

GOLDEN = [
    {
        "question": "How much is the refund fee?",
        "answer_keywords": ["10%", "fee"],          # content the answer must contain
        "source_doc": "refund-policy.md",            # the document holding the answer chunk
    },
    {
        "question": "How many vacation days in the first year?",
        "answer_keywords": ["11 days"],
        "source_doc": "hr-handbook.md",
    },
    # 20-30 picked from real user questions is enough to start
]

It matters that you pick questions from real user logs rather than inventing them in your head. Invented questions resemble the document’s wording, so retrieval succeeds too easily; real questions use different vocabulary from the documents, which makes them hard. A hard exam is a good exam.

Measuring a baseline #

With a golden set, you can measure two numbers: whether retrieval fetched the answer document (retrieval hit rate), and whether the answer contains the correct content (answer accuracy).

baseline.py

def measure(golden: list, top_k: int = 5) -> tuple:
    retrieval_hits, answer_hits = 0, 0
    for case in golden:
        chunks = search(case["question"], top_k=top_k)
        if any(c.source == case["source_doc"] for c in chunks):
            retrieval_hits += 1
        answer = rag_answer(case["question"])     # the RAG answer function from Part 8
        if all(kw in answer for kw in case["answer_keywords"]):
            answer_hits += 1
    n = len(golden)
    return retrieval_hits / n, answer_hits / n

retrieval, answer = measure(GOLDEN)
print(f"Retrieval hit rate: {retrieval:.0%}  Answer accuracy: {answer:.0%}")

Grading answers by keyword inclusion is crude, but for a baseline it is enough. More refined grading (an LLM judge) comes in Part 6. What matters right now is that you have two numbers. With a baseline like “retrieval hit rate 70%, answer accuracy 55%”, every future improvement can be judged against those numbers.

The gap between the two numbers is information too. If the retrieval hit rate is high but answer accuracy is far below it, the generation side is the bigger problem; if both are low, retrieval is what to fix first.

Where people commonly trip #

Judging improvement by feel — Throwing a few questions at it and concluding “seems better” misses the things that got worse out of sight. Compare before and after on the same golden set.
Filling the golden set with easy questions — Questions that copy the document’s sentences almost verbatim always pass, so they detect no change. Pick from real logs, especially the questions that failed.
Treating retrieval and generation as one lump — If you stop at “RAG got it wrong,” you cannot decide what to fix. Always open the retrieval results and pin down the failure point.

Wrapping up #

In this post we covered the starting point of RAG improvement: diagnosis and measurement.

Failures split into retrieval failures and generation failures, and the remedies differ. Opening the retrieval results yourself is the first diagnosis.
Build a golden set from real user questions, and measure a baseline with two numbers: retrieval hit rate and answer accuracy.
Every improvement from here on is judged by comparison against this baseline.

Now that we have a baseline, the fixing begins. The most common root of retrieval failure is not the search algorithm but the stage before it. In the next post, “Advanced RAG #2: Chunking Strategies That Decide Retrieval Quality”, we start by reworking how documents get split.