LLM App Development #8: Building a RAG Pipeline

5 min read

In Part 7 we learned how to find documents related to a question. Now we hand the found documents to Claude and have it answer based on them. This approach is called RAG (Retrieval-Augmented Generation). It is the core structure of apps like a chatbot whose answers live in our documents, or internal knowledge search.

The RAG flow #

RAG moves in three steps.

  1. Retrieval — find documents similar to the question with vector search.
  2. Augmented — place the found documents in the prompt along with the question.
  3. Generation — Claude answers grounded in those documents.

The core idea is simple. Claude does not know our documents, but if we put the documents in the prompt, it can read them and answer. The vector search from Part 7 was step 1, picking “the related documents worth putting in the prompt.”

Splitting documents into pieces #

Putting a long document in whole has two problems. It costs many tokens, and parts unrelated to the question mix in and blur the answer. So we split the document into small pieces by meaning, then embed those pieces (chunks). Search happens at the chunk level, not the whole document.

chunking.py
def chunk_text(text: str, size: int = 300, overlap: int = 50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start = end - overlap  # overlap a little so context is not cut off between chunks
    return chunks

The reason to overlap chunks slightly is to reduce cases where a sentence cut at a boundary is complete in neither chunk. Tune chunk size and overlap to the nature of the document.

Connecting retrieval and generation #

Connecting Claude’s generation to the vector search from Part 7 completes RAG. Put the found chunks in the prompt and instruct it to “answer based only on this material.”

rag.py
import numpy as np
import anthropic
from sentence_transformers import SentenceTransformer

client = anthropic.Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# embed the chunks in advance (chunks is the list made by chunk_text)
chunk_vectors = embedder.encode(chunks)

def retrieve(query: str, top_k: int = 3):
    q = embedder.encode([query])[0]
    scores = chunk_vectors @ q
    ranked = np.argsort(scores)[::-1][:top_k]
    return [chunks[i] for i in ranked]

def answer(query: str) -> str:
    found = retrieve(query)
    context = "\n\n".join(f"<doc>{c}</doc>" for c in found)

    prompt = f"""Answer the question based only on the content inside <docs>.
If the answer is not in the docs, reply "Not found in the documents."

<docs>
{context}
</docs>

Question: {query}"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    return next(b.text for b in response.content if b.type == "text")

print(answer("Within how many days must I request a refund?"))

When a question arrives, we find three related chunks, wrap them in <docs> tags (the approach from Part 4), and put them in the prompt. Claude answers looking only at that material. The instruction to answer “not found” rather than make something up for questions not in the material is important. This one line greatly reduces hallucination.

Why use RAG #

You might wonder why not just put the entire document set in the prompt every time. If there are few documents, you can. But with hundreds of documents the token cost becomes unbearable, and unrelated content drags down answer quality too. RAG puts in only the related chunks per question, so no matter how many documents there are, the prompt stays small.

One more thing: when documents change, you only re-embed the affected chunks. There is no need to retrain a model. Being easy to keep current is another big advantage of RAG.

Improving retrieval quality #

When a RAG answer is poor, the cause is more often retrieval than generation. If it pulls the wrong chunks, it does not matter how well Claude answers. Here are a few ways to improve retrieval quality.

  • Tuning chunk size — the first thing to adjust. If the answer fits in a paragraph, it is good to match chunks to that paragraph unit.
  • Number retrieved (top_k) — too few and you miss the right chunk; too many and unrelated content mixes in. Usually start at 3 to 5 and adjust.
  • Mixing with keyword search — semantic search links “refund” and “payment cancellation,” but is weak at things that must match exactly, like product codes or proper nouns. Using vector and keyword search together (hybrid search) covers each other’s weaknesses.

Showing the source alongside the answer also builds trust. Displaying which chunk an answer was based on lets users verify it and makes hallucinations easy to catch.

Where people commonly trip up #

  • Chunks too large or too small — too large and unrelated content mixes in; too small and context breaks. Tune size to the document, and when search results are off, suspect chunk size first.
  • Omitting the grounding instruction — without “answer based only on the material,” Claude may ignore the material and answer from learned knowledge, mixing in hallucination.
  • Not checking retrieval quality — when an answer is odd, it is often the retrieval step, not generation — the wrong chunks were fetched in the first place. Check what retrieve brought back first.

Wrapping up #

In this post we connected retrieval and generation to build a RAG pipeline.

  • RAG produces answers grounded in our documents through the three steps of retrieval, augmentation, and generation.
  • Long documents are split into chunks and embedded, and only the related chunks are put in the prompt per question.
  • The instruction to “answer only from the material” reduces hallucination, and the prompt stays small even with many documents.

So far a single question and answer has been the focus. In the next post, “LLM App Development #9: Conversation Memory and Context Management,” we will cover how to handle the history that piles up as a conversation grows, and how to continue within the context limit.

X