Advanced RAG #4: Query Transformation and Reranking

6 min read

Through Part 3 we refined the chunks and the retriever. This time we look at both ends of the retrieval pipeline. At the front, we transform the user’s question into a form that searches well (query transformation); at the back, we precisely narrow down a widely fetched pool of candidates (reranking). Both techniques leave the retriever itself alone and only touch its input and output, so they slot easily into an existing pipeline.

A user’s question is not a search query #

One branch of retrieval failures lies on the question side. Real user questions look like this.

  • “So until when can I do that?” — without the conversation context, you cannot even tell what is being asked.
  • “I want a refund, I paid by card, and what happens if it’s a partial cancellation?” — several questions stacked into one sentence.

Embedding these questions as-is makes retrieval shaky. So we add a step that transforms the question before searching.

Query rewriting — fill in the context to make a standalone question #

In chatbot-style RAG, the transformation with the biggest payoff is rewriting the question with the conversation context folded in. A fast, cheap model is plenty, so it can be a different model from the one producing the main answer.

query_rewrite.py
def rewrite_query(history: list, question: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        system=(
            "Using the conversation context, rewrite the last question as a single "
            "standalone question for search. Replace pronouns with their actual "
            "referents. Do not answer; output only the question."
        ),
        messages=history + [{"role": "user", "content": question}],
    )
    return next(b.text for b in response.content if b.type == "text")

# "So until when can I do that?" → "Within how many days after payment can an order be canceled?"

The rewritten question is used only for retrieval; for generating the answer, the original question and conversation are used as-is. It is an auxiliary step for search, not a step that changes the conversation.

Multi-query — one question from several angles #

When the vocabulary of the documents and the question differ (the case from Part 1’s diagnosis where descriptive questions failed often), multi-query works well: generate several differently worded versions of the same question and search with all of them.

multi_query.py
def expand_queries(question: str, n: int = 3) -> list:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        system=f"Rewrite the question into {n} differently worded versions for document search. One per line.",
        messages=[{"role": "user", "content": question}],
    )
    text = next(b.text for b in response.content if b.type == "text")
    return [question] + [q.strip() for q in text.split("\n") if q.strip()]

def multi_query_search(question: str, top_k: int = 5) -> list:
    rankings = [vector_search_ids(q, top_k=20) for q in expand_queries(question)]
    return [chunks[i] for i in rrf(rankings, top_k=top_k)]   # reuse RRF from Part 3

To merge the results of each query, we reuse the RRF from Part 3 as-is. Since extra calls add latency and cost, the balanced choice is to apply it only to the group of questions that rewriting alone could not fix, rather than turning it on for every question.

Reranking — fetch wide, then narrow precisely #

Now the back end. Embedding search has one structural limitation. Because it turns the question and the document into vectors separately and then measures distance, it lacks the precision of comparing the two side by side. Reranking — re-sorting retrieved candidates with a more precise model — compensates for this limitation. It uses a cross-encoder model that reads the question and a chunk together as a pair and scores their relevance, re-ordering a widely fetched pool of candidates.

rerank.py
from sentence_transformers import CrossEncoder

# A multilingual reranker. Works across languages.
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def search_with_rerank(question: str, top_k: int = 5) -> list:
    candidates = hybrid_search(question, top_k=30)        # stage 1: fetch wide, 30 candidates
    pairs = [(question, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)                       # stage 2: precise pairwise scoring
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_k]]                  # only the top 5 go to generation

The structure is the point. Stage 1 retrieval is fast and wide; stage 2 reranking is slow but precise — a division of labor. Stage 1 handles recall; stage 2 handles precision. A cross-encoder runs the model on every pair, so it is too slow for the full chunk set, but it handles 30 or so candidates just fine.

Instead of a dedicated reranker, LLM reranking — asking an LLM like Claude “is this chunk relevant to this question” — is also possible. It is more flexible but slower and more expensive. I recommend starting with a dedicated reranker and considering LLM reranking only when you find areas where it falls short.

How far to stack #

Turn on every technique from Part 2 onward and the pipeline becomes: rewriting → (multi-query) → hybrid search → reranking → generation. Each stage adds latency and cost, so turning everything on is not the goal. The goal is keeping only the stages that move the numbers in the golden-set measurement from Part 1. Empirically, the starting points with the best effect-to-cost ratio are two: rewriting (if it is a chatbot) and reranking.

Where people commonly trip up #

  • Generating the answer from the rewritten question — the transformed question is for retrieval. Use it for generation too, and you end up answering a question the user never asked.
  • Fetching narrow, then reranking — reranking 5 candidates only shuffles their order; no new correct chunk can enter. Reranking presupposes a wide stage 1 (20-50 candidates).
  • Using a big model for transformation — rewriting and query expansion are simple tasks, so a small model is enough. If the auxiliary step costs more than the main answer, the tail is wagging the dog.

Wrapping up #

In this post we reinforced the front and back ends of retrieval.

  • Conversational questions become standalone questions through rewriting, and questions with a large vocabulary gap are searched from several angles with multi-query.
  • Reranking narrows a widely fetched pool of candidates precisely with a cross-encoder. Stage 1 handles recall, stage 2 handles precision.
  • Rather than turning on every technique, keep only the stages that raise the golden-set numbers.

That covers retrieval quality — the job of fetching the right chunks. Next is the generation side. We tackle answers that go wrong even when given the right chunks, or that make up content not in the chunks. The next post is “Advanced RAG #5: Reducing Hallucinations with Citations.”

X