The Complete Guide to Hybrid Search in RAG (BM25 + Embeddings + Reranker)
By Dave Ebbelaar
Key Concepts
- Hybrid Retrieval: A search strategy combining sparse (keyword-based) and dense (semantic) retrieval to leverage the strengths of both.
- BM25 (Best Matching 25): A ranking function used to estimate the relevance of documents to a given search query based on keyword frequency.
- Dense Embeddings: Numerical vector representations of text that capture semantic meaning, allowing for retrieval based on context rather than exact keyword matches.
- RRF (Reciprocal Rank Fusion): An algorithm used to combine multiple retrieval results by aggregating their ranks rather than their raw scores.
- Re-ranker (Cross-Encoder): A model that processes the query and retrieved documents simultaneously to refine the order of results based on deep semantic relevance.
- NDCG (Normalized Discounted Cumulative Gain): A metric used to evaluate the quality of information retrieval systems by measuring the ranking performance.
- BEIR Benchmark: A heterogeneous benchmark for information retrieval used to evaluate the effectiveness of retrieval systems.
1. Data Preparation and Exploration
The tutorial utilizes the Finance QA dataset (part of the BEIR benchmark), which consists of three core components:
- Corpus: The collection of documents (stored as Parquet files).
- Queries: The questions users ask.
- QRLs (Relationships): A mapping file linking specific Query IDs to relevant Corpus IDs (ground truth).
Methodology:
- Data is loaded using the
datasetslibrary and processed viapandasfor exploration. - A filtering step is applied to ensure only queries with associated ground-truth documents are used for evaluation, reducing the dataset from ~1,700 to 648 queries.
2. Building the Retrieval Pipeline
The system is built in four distinct stages, each designed to be modular and production-ready.
A. BM25 (Sparse Retrieval)
- Implementation: Uses the
bm25slibrary. - Process: Tokenizes the corpus, removes English stop words, and creates an index.
- Storage: The index is saved locally to disk (approx. 33MB for 57,000 documents), eliminating the need for a dedicated database for smaller-to-medium enterprise use cases.
B. Dense Embeddings
- Implementation: Uses OpenAI’s
text-embedding-3-smallmodel (1536 dimensions). - Process: Documents are converted into vectors and stored as a
numpyarray. - Optimization: The tutorial emphasizes that for corpora under ~1 million chunks, storing vectors in a
numpyarray on disk is often more efficient than deploying a complex vector database. - Normalization: Vectors are pre-normalized to allow for efficient similarity calculation using the dot product (matrix multiplication) instead of computationally heavier cosine similarity.
C. Reciprocal Rank Fusion (RRF)
- Purpose: Combines the results of BM25 and Dense retrieval.
- Logic: Since raw scores from different retrieval methods are on different scales, RRF uses the formula $1 / (k + \text{rank})$ (where $k=60$) to normalize and aggregate rankings. This avoids the "apples to oranges" comparison problem.
D. Re-ranking
- Implementation: Uses the Cohere
rerank-fastmodel. - Process: Takes the top candidates from the RRF stage and performs a final, high-precision re-ordering based on the specific query-document relationship.
3. Evaluation and Optimization
- Metric: The system uses NDCG@10 to measure performance.
- Findings:
- BM25 alone: ~28% score (poor for this specific dataset).
- Dense alone: Significantly higher than BM25.
- Hybrid + Re-ranking: ~47% score.
- Key Insight: The re-ranker provides the most significant performance jump. The author suggests that developers should experiment with their specific corpus to determine if the full stack is necessary or if specific components (like BM25) can be omitted to reduce latency.
4. Practical Application for Custom Data
To implement this in a real-world environment without existing ground truth:
- Generate Synthetic Data: Use an LLM to generate realistic questions for your existing documents.
- Structure: Format the generated data into the Query/Corpus/Relationship structure used in the tutorial.
- Iterate: Use the NDCG evaluation framework to scientifically test changes (e.g., different chunking strategies or embedding models) rather than relying on "vibes."
Synthesis
The tutorial demonstrates that a high-performance production RAG system does not necessarily require a complex vector database stack. By building a hybrid pipeline—combining the keyword precision of BM25 with the semantic depth of dense embeddings, fused via RRF, and polished with a re-ranker—engineers can achieve superior retrieval accuracy. The core takeaway is the importance of evaluation: by creating a ground-truth dataset, developers can move from anecdotal testing to a data-driven, optimized retrieval architecture.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.