Advanced RAG #3: Hybrid Search — Combining Vectors and Keywords

2 min read

In Part 2 we refined the chunks, so now it is time for how we search. So far, retrieval has been embedding similarity alone — semantic search. Semantic search has the strength of connecting “refund” and “payment cancellation” as the same meaning, but its weaknesses are just as clear. It misses questions that need to find exactly that string — codes like ORD-2026-0001, product names, people’s names. If the diagnosis in Part 1 showed proper-noun questions failing disproportionately, this post is the prescription.

The character of the two searches #

Vector (semantic) searchKeyword search
StrengthsSynonyms, different phrasing (“refund” = “payment cancellation”)Exact matches (codes, proper nouns, abbreviations)
WeaknessesRare tokens get buried in the embeddingMisses different phrasing
Representative algorithmEmbeddings + cosine similarityBM25

Hybrid search runs both and merges the results. When one side misses, the other catches it, so retrieval that used to swing wildly by question type becomes consistently stable.

Building BM25 keyword search #

BM25 is the standard algorithm for keyword search. It scores documents by combining how often a word appears in a document (term frequency) with how rare the word is (inverse document frequency). The rank_bm25 package makes it ready to use.

terminal
pip install rank_bm25
bm25_search.py
from rank_bm25 import BM25Okapi

# Tokenization: the demo splits on whitespace. For Korean, a morphological
# analyzer (such as kiwipiepy) raises quality significantly.
tokenized = [chunk["text"].split() for chunk in chunks]
bm25 = BM25Okapi(tokenized)

def keyword_search(question: str, top_k: int = 20) -> list:
    scores = bm25.get_scores(question.split())
    ranked = sorted(range(len(chunks)), key=lambda i: scores[i], reverse=True)
    return ranked[:top_k]   # list of chunk indexes

In Korean, whitespace splitting treats “환불했어요” (“I got a refund”) and “환불” (“refund”) as different tokens. The demo works as is, but in a real service, tokenizing with a Korean morphological analyzer makes a big difference in BM25 quality.

Fusing the results with RRF #

Vector search and BM25 score on different scales (cosine similarity is 0-1, BM25 has no upper bound), so you cannot add the scores directly. Instead, we merge by rank. The standard method is RRF (Reciprocal Rank Fusion). It converts each rank to its reciprocal and sums them — simple but proven.

rrf_fusion.py
def rrf(rankings: list, k: int = 60, top_k: int = 5) -> list:
    """rankings: lists of chunk indexes returned by each retriever."""
    scores = {}
    for ranked in rankings:
        for rank, idx in enumerate(ranked):
            scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
    fused = sorted(scores, key=scores.get, reverse=True)
    return fused[:top_k]

def hybrid_search(question: str, top_k: int = 5) -> list:
    vec = vector_search_ids(question, top_k=20)     # top 20 from vector search
    kw = keyword_search(question, top_k=20)         # top 20 from BM25
    return [chunks[i] for i in rrf([vec, kw], top_k=top_k)]

The constant k=60 is the conventional default; its job is to smooth out rank differences near the top. In most cases you can leave it as is. The part worth noticing is that each retriever fetches far more than the final count — 20 each — before fusion. A chunk that ranks mid-tier in both searches can rise to the top after fusion.

When it helps, and when it does not #

The impact of hybrid search depends on your question distribution.

  • Big impact — when many questions contain product codes, error codes, person or product names, or internal abbreviations. These tokens have weak discriminative power in embedding space, so vector search misses them often.
  • Small impact — when questions are mostly descriptive and use different vocabulary from the documents. Keyword search adds little here, and the query transformation in Part 4 is the bigger prescription instead.

So the adoption decision also comes from the golden set from Part 1. Split the golden set into “proper-noun questions” and “descriptive questions” and measure, and you will see exactly which question group hybrid search lifted, and by how much.

If you are already using a vector database #

The implementation above is code that shows the principle, but in practice you often do not need to build it yourself. Many of the vector databases introduced in Part 7 of the LLM App Development series (Qdrant, pgvector combinations, and so on) provide keyword search or hybrid search as a feature. Knowing the principle lets you read what those options actually adjust, and decide between rolling your own and using the built-in feature.

Common stumbling blocks #

  • Adding raw scores — cosine similarity and BM25 scores are on different scales, so adding them directly lets one side dominate. Merge by rank (RRF) or normalize first.
  • Fetching candidates too narrowly before fusion — if each retriever only returns the final count, fusion loses much of its point. Fetch generously (3-4x) and then merge.
  • Forgetting tokenization — tokenization is half of BM25 quality. Splitting Korean on whitespace alone misses words with attached particles. Consider a morphological analyzer.

Wrapping up #

In this post we combined vector search and keyword search.

  • Semantic search is strong on synonyms and weak on proper nouns, and BM25 is the opposite. Hybrid search lets each fill in the other’s gaps.
  • Two result sets with different score scales are fused by rank using RRF. Fetch candidates generously before merging.
  • The impact depends on the question distribution, so split the golden set by question type, measure, and decide on adoption.

The quality of the candidates that retrieval brings back has improved. Next, it is time to refine the question itself and the order of the candidates. In the next post, “Advanced RAG #4: Query Transformation and Reranking,” we reinforce the front and back ends of retrieval.

X