Advanced RAG #7: Capstone Project — Upgrading the Document Q&A Bot

6 min read

By Part 6, we had gathered all the tools for diagnosis, improvement, and measurement. In this final post, we upgrade the internal document Q&A bot we built in LLM App Development Part 13 with the techniques from this series. We do not change everything at once — applying one change at a time and measuring after each one is the whole point of this post, because improving RAG in real work looks exactly like this.

The starting point — the Part 13 bot and its baseline #

By the standards of this series, the Part 13 bot is the simplest possible setup: fixed-size chunking, a single vector search, no source attribution. First, following Part 1, we build a golden set of 30 questions from real query logs, and measure the baseline with the evaluation pipeline from Part 6.

step0_baseline.py
result = evaluate(GOLDEN, top_k=5)
# {'recall@k': 0.63, 'mrr': 0.44, 'accuracy': 0.57, 'hallucination_rate': 0.13}

These numbers come from this example document set; on your data they will be different. What matters is not the absolute values but how to read them. A recall@k of 0.63 means one in three questions fails to retrieve even the correct chunk, so fixing retrieval comes first. A hallucination rate of 13% is a level where a bot without sources will struggle to earn trust.

Step 1 — Swapping the chunking #

We switch to the structure-based chunking from Part 2. Since the internal documents are Markdown, we split at heading boundaries, keep tables intact, attach source and section metadata to every chunk, and rebuild the index.

step1_chunking.py
chunks = []
for doc in load_documents("docs/"):
    for c in chunk_by_heading(doc.text, max_chars=1500):
        chunks.append({"text": c, "metadata": {"source": doc.name, "section": heading_of(c)}})
rebuild_index(chunks)

# re-evaluation: recall@k 0.63 → 0.77, accuracy 0.57 → 0.67

We did not change a single line of the search algorithm, yet recall jumped the most of any step. The failures from diagnosis where “a table was cut in half and both pieces were useless” are gone. This is why chunking sits at the front of the series: if you do not fix the foundation, the later techniques cannot deliver their full effect.

Step 2 — Hybrid search #

Splitting the golden set by question type shows that most of the remaining retrieval failures are questions about product codes and internal abbreviations. We add the BM25 + RRF combination from Part 3.

step2_hybrid.py
def search(question: str, top_k: int = 5) -> list:
    vec = vector_search_ids(question, top_k=20)
    kw = keyword_search(question, top_k=20)
    return [chunks[i] for i in rrf([vec, kw], top_k=top_k)]

# re-evaluation: recall@k 0.77 → 0.87 (proper-noun question group 0.50 → 0.90)

The per-group numbers tell us more than the overall ones. The descriptive question group barely moved, while the proper-noun question group rose sharply. That is confirmation that the technique we introduced is working exactly where it was aimed.

Step 3 — Reranking #

Recall has come up, but MRR is not following. That means the correct chunk is within the top 5 but often stuck at rank 4 or 5. We attach the cross-encoder reranking from Part 4: widen the candidates to 30 and let the reranker narrow them to 5.

step3_rerank.py
def search(question: str, top_k: int = 5) -> list:
    candidates = hybrid_search(question, top_k=30)
    pairs = [(question, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_k]]

# re-evaluation: mrr 0.58 → 0.74, accuracy 0.70 → 0.77

As MRR rose, accuracy followed. With the correct chunk sitting clearly near the top of the context, generation improves too — evidence that retrieval and generation are connected. Since this is an internal bot, we accept the sub-second latency reranking adds. That decision is itself a trade-off; a latency-sensitive service might reach a different conclusion.

Step 4 — Citations and the right to say “I don’t know” #

Finally we apply Part 5. We add the grounding constraint and the “right to say I don’t know” to the system prompt, convert the chunks into document blocks to turn on citations, and attach a citation-ratio gate.

step4_citations.py
def qa(question: str) -> str:
    chunks = search(question, top_k=5)
    response = answer_with_citations(question, chunks)   # implementation from Part 5
    if not is_grounded(response):
        return "I couldn't find enough relevant documents. Could you rephrase the question?"
    return render(response)

# re-evaluation: hallucination_rate 0.10 → 0.03, accuracy 0.77 → 0.80

The hallucination rate dropped into the low single digits, and with a source attached to every answer, users can now verify for themselves. From the operator’s perspective, the logs of “couldn’t find” responses become new diagnostic material: those questions are the next documents to add and the next golden-set candidates.

The full journey #

Steprecall@5MRRaccuracyhallucination rate
Baseline (Part 13 bot)0.630.440.570.13
+ structure-based chunking0.770.550.670.13
+ hybrid search0.870.580.700.10
+ reranking0.870.740.770.10
+ citations and gate0.870.740.800.03

Knowing how to read this table is the summary of the whole series. No single row raised every column. Each technique raised the metric it was aimed at, and because we measured, we could know that. To go further from here, the query transformation from Part 4 (handling conversational questions) is the next candidate, and if the documents keep growing, automating the Part 6 evaluation as a scheduled run is the next step in operations.

Closing the series #

Looking back over the seven posts:

  • Improvement starts with diagnosis. Split failures into retrieval and generation, and build a baseline with a golden set (Part 1).
  • The foundation of retrieval quality is chunking. Cut along document structure and keep tables intact (Part 2).
  • Combine vector and keyword search with RRF so each covers the other’s weaknesses (Part 3).
  • Refine questions with rewriting, and narrow broadly retrieved candidates with reranking (Part 4).
  • Give generation a grounding constraint and the right to say “I don’t know”, and attach verifiable sources with citations (Part 5).
  • Run the four metrics — recall@k, MRR, accuracy, hallucination rate — as a regression test (Part 6).
  • And apply changes one at a time, always with measurement (Part 7).

RAG is not a system you build once and leave alone; it is a system you keep tending while the documents and the questions change. With measurement, that process becomes engineering instead of guesswork. I hope this series serves as that turning point. Thank you for joining along the way.

X