English · Español

Lab 02 — BM25 baseline + RRF hybrid¶

🇪🇸 BM25 es búsqueda léxica clásica: pondera palabras por TF-IDF, normaliza por longitud, devuelve ranking. Es el otro lado del bi-encoder. Si combinas ambos con Reciprocal Rank Fusion, el ranking conjunto suele superar a cada uno por separado. Esta lab construye los dos sistemas y prueba la hipótesis.

Objective¶

Implement BM25 over the grammar-rule KB, evaluate it standalone, then fuse with the dense retriever from Lab 01 using Reciprocal Rank Fusion (RRF). Demonstrate the hybrid beats both standalone retrievers by ≥ 5pp on hit@5.

Setup¶

Lab 01 done (dense retriever in src/minirag/retrieve.py; eval set with 30 queries).
Pure Python; no external BM25 library (we implement it for understanding).
theory/03-hybrid-search-and-reranking.md covers the math.

The BM25 formula¶

For a query \(q = \{q_1, q_2, ..., q_n\}\) and a document \(D\):

\[\text{BM25}(q, D) = \sum_{i=1}^n \text{IDF}(q_i) \cdot \frac{f(q_i, D)(k_1 + 1)}{f(q_i, D) + k_1 (1 - b + b \cdot |D| / \text{avgdl})}\]

Where:

\(f(q_i, D)\) = term frequency of \(q_i\) in \(D\).
\(|D|\) = document length in tokens.
\(\text{avgdl}\) = average document length across the corpus.
\(\text{IDF}(q_i) = \log\left( \frac{N - \text{df}(q_i) + 0.5}{\text{df}(q_i) + 0.5} + 1 \right)\) where \(N\) = corpus size, \(\text{df}(q_i)\) = document frequency.
\(k_1 = 1.5\), \(b = 0.75\) — standard defaults.

The intuition: words common in this doc but rare in the corpus contribute most; long documents are slightly penalized.

Tasks¶

Part A — Implement `src/minirag/bm25.py`¶

import math
import re
from collections import Counter
from dataclasses import dataclass

@dataclass
class BM25:
    chunk_ids: list[str]
    tokens_per_doc: list[list[str]]
    avgdl: float
    df: Counter         # term → number of docs containing it
    N: int              # corpus size
    k1: float = 1.5
    b: float = 0.75

    @classmethod
    def from_chunks(cls, chunks) -> "BM25":
        ids = []
        tokens_per_doc = []
        df = Counter()
        for c in chunks:
            ids.append(c.chunk_id)
            t = tokenize(c.title + " " + c.body)
            tokens_per_doc.append(t)
            for term in set(t):
                df[term] += 1
        avgdl = sum(len(t) for t in tokens_per_doc) / len(tokens_per_doc)
        return cls(ids, tokens_per_doc, avgdl, df, N=len(ids))

    def idf(self, term: str) -> float:
        return math.log((self.N - self.df[term] + 0.5)
                        / (self.df[term] + 0.5) + 1.0)

    def score(self, query_tokens: list[str], doc_idx: int) -> float:
        doc = self.tokens_per_doc[doc_idx]
        doc_len = len(doc)
        tf = Counter(doc)
        s = 0.0
        for q in query_tokens:
            if q not in self.df:
                continue
            num = tf[q] * (self.k1 + 1)
            den = tf[q] + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)
            s += self.idf(q) * (num / den)
        return s

    def search(self, query: str, k: int = 5) -> list[tuple[str, float]]:
        q = tokenize(query)
        scores = [(self.chunk_ids[i], self.score(q, i)) for i in range(self.N)]
        scores.sort(key=lambda x: -x[1])
        return scores[:k]


TOKEN_RE = re.compile(r"\w+", re.UNICODE)

def tokenize(text: str) -> list[str]:
    return [t.lower() for t in TOKEN_RE.findall(text)]

Notes:

No stemming for Phase 29 — the KB is small enough that exact-form matching is acceptable. Stretch goal: Spanish stemmer.
No stop-word removal for the same reason. Stretch: try Spanish stop-words.
Token regex matches alphanumerics including accented characters (re.UNICODE).

Part B — Evaluate BM25 standalone¶

Reuse Lab 01's evaluate function with BM25Retriever wrapping the BM25 class:

class BM25Retriever:
    def __init__(self, kb_path):
        from .chunk import load_chunks
        chunks = load_chunks(kb_path)
        self.bm25 = BM25.from_chunks(chunks)
    def search(self, query, k=5):
        return self.bm25.search(query, k)

Expected: hit@5 of BM25 alone is somewhere in [0.40, 0.65]. Strong on keyword-overlap queries; weak on paraphrased / cross-language queries.

Part C — Implement RRF in `src/minirag/retrieve.py`¶

def reciprocal_rank_fusion(results_a: list[tuple[str, float]],
                           results_b: list[tuple[str, float]],
                           k_rrf: int = 60,
                           top_k: int = 5) -> list[tuple[str, float]]:
    """Fuse two ranked lists. RRF score = sum 1 / (k_rrf + rank).
    Scores from A and B are NOT used — only ranks."""
    score = {}
    for rank, (cid, _) in enumerate(results_a, start=1):
        score[cid] = score.get(cid, 0) + 1.0 / (k_rrf + rank)
    for rank, (cid, _) in enumerate(results_b, start=1):
        score[cid] = score.get(cid, 0) + 1.0 / (k_rrf + rank)
    fused = sorted(score.items(), key=lambda x: -x[1])
    return fused[:top_k]


class HybridRetriever:
    def __init__(self, dense, bm25, k_rrf: int = 60):
        self.dense = dense
        self.bm25 = bm25
        self.k_rrf = k_rrf
    def search(self, query: str, k: int = 5) -> list[tuple[str, float]]:
        n = max(k, 20)
        ra = self.dense.search(query, k=n)
        rb = self.bm25.search(query, k=n)
        return reciprocal_rank_fusion(ra, rb, k_rrf=self.k_rrf, top_k=k)

The k_rrf=60 default is the standard from the RRF paper (Cormack 2009). Try other values in the stretch section.

Part D — Compare three retrievers¶

Run Lab 01's evaluate on all three:

results = {
    "dense": evaluate(DenseRetriever(...), queries),
    "bm25":  evaluate(BM25Retriever(...),  queries),
    "hybrid":evaluate(HybridRetriever(...),queries),
}

Produce a comparison table:

Retriever	hit@1	hit@3	hit@5	MRR
Dense (Lab 01)	0.42	0.58	0.67	0.52
BM25	0.38	0.55	0.60	0.48
Hybrid (RRF, k=60)	0.50	0.70	0.78	0.61

Expected: hybrid beats both alone on hit@5 by ≥ 5pp. If it doesn't, debug:

Are dense and BM25 producing overlapping top-5s for most queries? If so, RRF can't add diversity.
Are dense rankings dominated by a small set of "always-top" chunks? Check the per-query CSV.
Is the k_rrf parameter too low (over-weighting top-1) or too high (everything looks the same)?

Part E — Per-language and per-difficulty breakdowns¶

Slice the eval set by:

Language (EN query / ES query / cross): does dense beat BM25 on cross-language? (Should — dense embeddings bridge languages.)
Difficulty (verbatim-keyword / paraphrased): does BM25 dominate verbatim? Does dense dominate paraphrased? Does hybrid average them gracefully?

Report this in experiments/29-bm25-hybrid/SLICED_REPORT.md.

Part F — Tests in `tests/minirag/test_bm25.py`¶

test_tokenize_accented — tokenize("ÉL trabajó") returns ["él", "trabajó"] (lowercase, accent preserved).
test_bm25_keyword_hit — given 3 docs and a query of one keyword present in 1 doc, that doc ranks first.
test_bm25_idf_rare_wins — given a query with 2 words, one rare (in 1 doc) and one common (in all 3 docs), the doc with the rare word wins.
test_rrf_fusion_combines — given two ranked lists where doc A is rank 2 in both, and doc B is rank 1 in only one, A's RRF score exceeds B's.
test_hybrid_beats_alone_on_eval — on a small synthetic eval, hybrid hit@5 ≥ max(dense, bm25) hit@5.

Deliverable¶

experiments/29-bm25-hybrid/: - REPORT.md — the 3-way comparison table + interpretation. - SLICED_REPORT.md — per-language and per-difficulty breakdowns. - metrics.json — aggregate numbers for all three retrievers. - per_query.csv — three columns of results (dense, bm25, hybrid) for each query. - manifest.json.

Acceptance¶

All 5 tests pass.
Hybrid hit@5 ≥ max(dense, bm25) hit@5 + 5pp.
The slicing analysis shows the predicted pattern: dense > BM25 on paraphrased, BM25 ≥ dense on verbatim, hybrid ≥ both on most slices.
If hybrid does NOT beat by 5pp, the report contains a debugging analysis explaining why (e.g., "BM25 and dense are highly correlated on this small KB").

Pitfalls¶

Tokenization mismatch between BM25 and the dense embedder. BM25 uses your tokenize function; the embedder uses its own tokenizer. They're supposed to differ — BM25 is lexical, dense is semantic. Don't try to unify them.
Including chunk metadata in the BM25 text. If you tokenize the JSON object (e.g., field names like "language"), every chunk has those words and BM25's IDF is destroyed. Tokenize only title + body.
RRF score being too sensitive to k_rrf. Default 60 is robust. If you go to 5, top-1 dominates. If you go to 1000, everything is equal.
Hybrid worse than alone. If you see this, either dense and BM25 are pointing to different wrong answers (combining them spreads the mass), or one retriever is contributing pure noise. Diagnose with the per-query CSV.
Spanish stop-words count as content. Words like de, el, la, que are everywhere in Spanish queries; without stop-word removal, BM25's IDF down-weights them but they still slightly contaminate the score. Acceptable at N=50.

Stretch¶

Add Spanish stemming (Snowball) to BM25's tokenizer. Measure the per-language hit-rate change.
Try different k_rrf values (5, 20, 60, 200). Plot hit@5 vs k_rrf.
Implement alternative fusion: linear combination of normalized scores. Compare to RRF.
Add a cross-encoder reranker (theory 03) on top of the hybrid's top-20. Measure the lift.

Next lab: lab/03-end-to-end-rag.md.