English · Español

03 — Hybrid Search and Reranking¶

🇪🇸 Dense embeddings entienden semántica; BM25 entiende palabras exactas. Combinar ambos (hybrid search, fusión vía RRF) consistentemente bate a cualquiera por separado en consultas factuales cortas. El reranker es un segundo paso de precisión sobre los top-k candidatos. Tres niveles de costo: BM25 (barato) → dense (medio) → cross-encoder (caro). Cada uno aporta donde el anterior falla.

Dense retrieval's weakness¶

Dense embeddings excel at semantic similarity but can underperform on:

Rare or out-of-distribution terms. "JIT" or "RoPE" or a specific verb in our KB may not appear in the embedding model's pretraining distribution. The embedding becomes near-random.
Exact-match queries. "What's the past participle of eat?" — the literal token eat is a strong signal. Dense retrieval might still find it (modulo the rare-term concern) but might also drift to similar-meaning but different-verb chunks.
Short queries with one informative token. "eat" alone is much higher-signal than the full query — but dense embeddings average everything.

BM25: the lexical baseline¶

BM25 (Best Match 25) is a probabilistic ranking function from the 1990s. Conceptually: "documents with more occurrences of the query terms rank higher, weighted by how rare those terms are in the corpus".

Formula:

\[ \text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{tf}_{t,d} \cdot (k_1 + 1)}{\text{tf}_{t,d} + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})} \]

Where: - t ranges over query terms. - tf_{t,d} = how often term t appears in doc d. - IDF(t) = log((N - df_t + 0.5) / (df_t + 0.5) + 1) where df_t = number of docs containing t, N = total docs. - k_1 ≈ 1.2 controls term-frequency saturation. - b ≈ 0.75 controls length normalization.

No training. BM25 is statistical, not learned. It's been the standard for 30+ years and remains a strong baseline.

Tokenization for BM25¶

BM25 operates on tokens. The choice of tokenizer matters:

Whitespace + lowercase: simple, English-fine.
Sub-word (BPE / WordPiece): more robust to morphology (eat, eats, eaten become the same root token if subword-aware).
Stemming (Porter, Snowball): reduces eating and eats to eat. Lossy but useful for short corpora.

For Phase 29 we use whitespace + lowercase + Porter stemming. The KB is small enough that stemming makes a real difference for verb-related queries.

Stopwords¶

Words like "is", "the", "what" appear in nearly every query. Their high tf and low IDF mean they contribute little to ranking but inflate scores. BM25 handles this automatically via IDF (a near-zero IDF for "the" makes it negligible).

For Phase 29 we don't explicitly remove stopwords. The IDF handles it.

Why hybrid wins¶

The empirical observation across many RAG benchmarks: dense + sparse together > either alone.

Dense: captures synonyms, paraphrases, latent semantic structure.
Sparse (BM25): captures exact-match, rare terms, lexical precision.

The failure modes of dense and BM25 are largely orthogonal: dense can miss exact-match queries that BM25 catches; BM25 can miss paraphrased queries that dense catches.

For our grammar-rule KB: "past participle of eat" has both a literal anchor (eat) and a semantic anchor (the concept of past participle). BM25 finds the literal; dense finds the semantic. Together they cover both.

Fusion: Reciprocal Rank Fusion (RRF)¶

The standard way to combine two retrieval systems:

\[ \text{RRF\_score}(d) = \sum_{i \in \text{systems}} \frac{1}{k + \text{rank}_i(d)} \]

Where k = 60 is a smoothing constant (the canonical value from Cormack et al., 2009).

Each retrieval system produces a ranking; doc d has rank rank_i(d) in system i.
A doc that appears highly in both systems gets a high fused score.
A doc that appears highly in one but not the other gets a moderate score.
A doc that appears in neither gets zero.

RRF doesn't require score calibration between the two systems. BM25 scores and cosine similarities are on different scales — RRF works on ranks only, sidestepping this problem.

The result is a single fused top-k.

Other fusion methods¶

Weighted score sum: α · cos_sim + (1-α) · BM25_normalized. Requires score normalization. Harder to tune.
CombSUM, CombMNZ: older IR fusion methods. Less common.
Learned fusion (a small neural model on top of ranks/scores): more powerful but requires training data.

RRF is by far the most popular in 2024-2026 production RAG. We use it.

When NOT to use hybrid¶

If your corpus has a very narrow vocabulary that the embedding model wasn't pretrained on (e.g., medical jargon, code, your company's product names), dense alone may be misleading. BM25 alone may be more reliable. In that case: pure BM25, or fine-tune the embedding model on in-domain data (out of Phase 29 scope).

Conversely, if your corpus is "well-covered" by the embedding model's pretraining (e.g., general English text), dense alone may suffice. BM25 adds little.

For our grammar-rule KB: somewhere in between. The verbs are common English; the embedding model was trained on general English. But the specific queries like "past participle of X" benefit from the lexical anchor of X. Hybrid wins by ~5pp hit-rate@5 (lab 02 confirms).

Reranking with a cross-encoder¶

After retrieval (whether dense, BM25, or hybrid), you have top-k candidates. The reranker rescores them with a cross-encoder:

\[ \text{rerank\_score}(q, d) = C(q, d) \]

Where C is a cross-encoder. For each (q, d) in top-k, run one forward pass; sort by the resulting scores.

Tradeoff:

Pro: Cross-encoders attend across q and d. They see token-level interactions. Much higher precision than bi-encoders.
Con: One forward pass per candidate. For top-100: 100 forward passes, ~5-10 seconds on CPU. Limits how many candidates you rerank.

Standard practice: bi-encoder retrieves top-100 → cross-encoder reranks → top-3 or top-5 fed to the reader.

Choice of reranker¶

For Phase 29 we use cross-encoder/ms-marco-MiniLM-L-6-v2 (60M params, MS-MARCO trained, English). CPU-fast (~10 candidates/sec).

Alternatives: BGE rerankers, Cohere rerank API, custom-trained. Out of scope.

When the reranker doesn't help¶

If your bi-encoder is already very accurate (high recall@k and high precision@1), the reranker's marginal gain is small. Measure first.

For our 50-doc KB with all-MiniLM-L6-v2, the retrieval recall@10 is essentially 100% (every gold doc is in top-10 of 50). The reranker rearranges the top-10. Whether the rearrangement moves the gold doc up depends on the cross-encoder quality.

Lab 03 measures: reranking improves hit-rate@1 by ~3pp; hit-rate@5 unchanged (since gold was already in top-5).

Putting it together: the Phase 29 retrieval stack¶

# 1. Tokenize query (for BM25)
q_tokens = tokenize(query)

# 2. Embed query (for dense)
q_emb = encoder.encode([query])  # (1, 384)
q_emb = normalize(q_emb)

# 3. Retrieve top-10 dense
dense_results = faiss_index.search(q_emb, k=10)  # [(doc_id, score), ...]

# 4. Retrieve top-10 BM25
bm25_results = bm25.search(q_tokens, k=10)  # [(doc_id, score), ...]

# 5. RRF fuse → top-10 fused
fused = rrf_fuse(dense_results, bm25_results, k=60)

# 6. Rerank top-10 with cross-encoder → top-3
candidates = [(q, doc_text(d)) for d in fused[:10]]
rerank_scores = cross_encoder.predict(candidates)
top3 = sorted(zip(fused[:10], rerank_scores), key=lambda x: -x[1])[:3]

# 7. Pass top-3 to reader as context

Each step is independently testable.

Latency budget¶

For a target of <500ms end-to-end on CPU:

Embed query: ~50ms.
FAISS-flat search (50 docs): ~1ms.
BM25 search (50 docs): ~5ms.
RRF fuse: <1ms.
Cross-encoder rerank (10 candidates): ~200ms.
Reader (MiniGPT, 50-token output): ~100ms with KV-cache.
Total: ~360ms. Fits.

For larger KBs, FAISS-HNSW keeps retrieval sub-linear; the bottleneck shifts to the reader.

Drill problems¶

Solutions at phase open in solutions/03-hybrid-and-reranking-ref.md.

Compute BM25 by hand for a corpus of 3 docs and a 2-term query. (Pick a concrete corpus.)
Two retrieval systems both rank doc D at position 1. RRF score for D? At positions 1 and 10? At positions 1 and "not in top-10"?
The cross-encoder's forward cost is O(L²) in input length L = |q| + |d|. For 100 candidates with L = 64, what's the total compute? How does it scale if L = 256?
If the bi-encoder retrieves the gold doc at position 1 already, does the cross-encoder reranker help? Construct a scenario where it does (gold not at position 1 initially).
Argue: in a domain where the embedding model has no in-domain pretraining (e.g., custom jargon), would you prefer (a) hybrid, (b) BM25-only, © fine-tuned dense?

One-paragraph recap¶

Dense retrieval captures semantic similarity; BM25 captures lexical match. Hybrid combines them via Reciprocal Rank Fusion — averaging ranks across systems — to consistently beat either alone. Reranking with a cross-encoder is a precision step on top-k candidates; it's expensive per query but bounded. The standard production stack: BM25 + dense → RRF top-10 → cross-encoder rerank top-3 → reader. For our small grammar-rule KB, hybrid + reranker is overkill but pedagogically informative — we want Borja to feel each component's contribution. Lab 02 ablates them; expect hybrid to win by ~5pp and reranker to add another ~3pp on hit-rate@1.

Next: theory/04-evaluation.md.