Skip to content

English · Español

Lab 01 — Bi-encoder baseline: dense retrieval with hit-rate@k

🇪🇸 Coges un bi-encoder pre-entrenado (sentence-transformers), embeces los 50 chunks, embeces 30 queries de eval, calculas distancia coseno y rankeas. Mides hit-rate@k y MRR. Esto es el baseline contra el cual BM25 y el híbrido del Lab 02 deben ganar.

Objective

Implement dense retrieval over the grammar-rule KB using a pretrained bi-encoder. Build a 30-query evaluation set, compute hit-rate@{1,3,5} and MRR, and produce a baseline that Lab 02's hybrid retriever must beat.

Setup

  • Lab 00 done (data/kb/grammar-rules/chunks.jsonl with ≥ 50 chunks).
  • sentence-transformers library, pinned. Pretrained model: paraphrase-multilingual-MiniLM-L12-v2 (handles EN+ES, ~117MB, runs on CPU in ~5s per 100 queries).
  • No CUDA needed.

Tasks

Part A — Build the eval set

data/kb/grammar-rules/eval/queries.jsonl — 30 queries, each with the expected chunk_id(s):

{
  "query_id": "q-001",
  "language": "en",
  "query": "How do I conjugate 'work' for she in present tense?",
  "expected_chunks": ["en-pres-3sg-regular-s-rule-001"],
  "expected_answer_pattern": "works"
}
{
  "query_id": "q-002",
  "language": "es",
  "query": "¿Cuál es el pasado de 'go' en inglés?",
  "expected_chunks": ["en-past-all-irregular-go-001"],
  "expected_answer_pattern": "went"
}

Required fields:

  • query: 1-2 sentence natural-language question. EN or ES.
  • expected_chunks: 1-3 chunk_ids that should be retrieved. Multiple if the question is genuinely ambiguous or could draw from multiple rules.
  • expected_answer_pattern: substring that the generated answer should contain (used in Lab 03). Lab 01 ignores this field.

Coverage: ≥ 10 EN queries, ≥ 10 ES queries, ≥ 10 cross-language (EN query about ES form or vice versa). Mix of easy (verbatim-keyword match) and hard (paraphrased) queries.

Part B — Implement src/minirag/embed.py

import numpy as np
import torch
from sentence_transformers import SentenceTransformer

class Embedder:
    def __init__(self, model_name: str = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"):
        self.model = SentenceTransformer(model_name)
        self.dim = self.model.get_sentence_embedding_dimension()

    def embed_texts(self, texts: list[str]) -> np.ndarray:
        """Returns (N, dim) float32 array; L2-normalized."""
        emb = self.model.encode(texts, normalize_embeddings=True,
                                convert_to_numpy=True, show_progress_bar=False)
        return emb.astype(np.float32)

L2-normalization is essential — it makes cosine similarity equal to dot product, which speeds up the inner loop.

Part C — Implement src/minirag/index.py (flat index)

import numpy as np

class FlatIndex:
    """Brute-force cosine similarity. For N=50 chunks, this is plenty."""
    def __init__(self, embeddings: np.ndarray, chunk_ids: list[str]):
        assert embeddings.ndim == 2
        self.emb = embeddings   # (N, D), L2-normalized
        self.ids = chunk_ids

    def search(self, query_emb: np.ndarray, k: int = 5) -> list[tuple[str, float]]:
        """Returns top-k (chunk_id, score) by cosine similarity."""
        # query_emb is (D,) or (1, D)
        if query_emb.ndim == 1:
            query_emb = query_emb[None, :]
        scores = self.emb @ query_emb.T   # (N, 1)
        scores = scores.squeeze(1)
        idx = np.argsort(-scores)[:k]
        return [(self.ids[i], float(scores[i])) for i in idx]

For 50 chunks this is microseconds — no FAISS needed at this scale. (Theory 02 covered when IVF / HNSW become relevant.)

Part D — Implement src/minirag/retrieve.py

from .embed import Embedder
from .index import FlatIndex
from .chunk import load_chunks
from pathlib import Path

class DenseRetriever:
    def __init__(self, kb_path: Path, embedder: Embedder):
        chunks = load_chunks(kb_path)
        self.chunks = {c.chunk_id: c for c in chunks}
        texts = [self._chunk_text(c) for c in chunks]
        emb = embedder.embed_texts(texts)
        self.index = FlatIndex(emb, [c.chunk_id for c in chunks])
        self.embedder = embedder

    @staticmethod
    def _chunk_text(c) -> str:
        """The text the embedder sees: title + body."""
        return c.title + "\n" + c.body

    def search(self, query: str, k: int = 5) -> list[tuple[str, float]]:
        qe = self.embedder.embed_texts([query])[0]
        return self.index.search(qe, k=k)

Part E — Evaluation

src/minirag/eval_retrieval.py:

def hit_rate_at_k(results, expected_chunks, k):
    """results: list[(chunk_id, score)] sorted by score desc.
       expected_chunks: list[str].
       Returns 1 if any expected is in top-k, else 0."""
    top_k = {r[0] for r in results[:k]}
    return int(bool(top_k & set(expected_chunks)))

def mrr(results, expected_chunks):
    """Reciprocal rank of the first expected chunk in results."""
    expected = set(expected_chunks)
    for rank, (chunk_id, _) in enumerate(results, start=1):
        if chunk_id in expected:
            return 1.0 / rank
    return 0.0

def evaluate(retriever, queries, k_values=(1, 3, 5)):
    metrics = {f"hit@{k}": [] for k in k_values}
    metrics["mrr"] = []
    per_query = []
    for q in queries:
        results = retriever.search(q["query"], k=max(k_values))
        for k in k_values:
            metrics[f"hit@{k}"].append(
                hit_rate_at_k(results, q["expected_chunks"], k))
        metrics["mrr"].append(mrr(results, q["expected_chunks"]))
        per_query.append({"query_id": q["query_id"], "results": results,
                          "hit@5": metrics["hit@5"][-1],
                          "mrr": metrics["mrr"][-1]})
    aggregated = {m: float(np.mean(v)) for m, v in metrics.items()}
    return aggregated, per_query

Part F — Run and report

embedder = Embedder()
retriever = DenseRetriever(Path("data/kb/grammar-rules/chunks.jsonl"), embedder)
queries = [json.loads(l) for l in open("data/kb/grammar-rules/eval/queries.jsonl")]
agg, per_query = evaluate(retriever, queries)
print(agg)

Expected target: hit@5 ≥ 0.65 for dense-alone baseline. (The phase DoD asks for ≥ 0.80 after hybrid.)

experiments/29-bi-encoder-baseline/REPORT.md:

  1. Eval set: 30 queries, coverage breakdown (EN/ES/cross).
  2. Embedder: model name, dim, normalization.
  3. Aggregate metrics: hit@1, hit@3, hit@5, MRR.
  4. Per-query results (sortable CSV).
  5. Failure analysis: list the 5-10 queries with hit@5 == 0. Common patterns? (Spanish queries failing on English-only chunks? Paraphrased queries that share no keywords?)

Part G — Tests in tests/minirag/test_retrieve.py

  1. test_embedder_shapeembed_texts(["hello", "world"]) returns shape (2, 384) (or whatever dim for the pinned model).
  2. test_embedder_normalized — every row has L2 norm 1.0 ± 1e-5.
  3. test_flat_index_recall — embedding a chunk and searching with the same text retrieves it as top-1.
  4. test_hit_rate_at_k — synthetic: with expected=['a'] and results=[('b',0.9),('a',0.7),('c',0.5)], hit@1=0, hit@2=1.
  5. test_mrr — same setup: mrr = 1/2 = 0.5.

Deliverable

experiments/29-bi-encoder-baseline/: - REPORT.md — items above. - metrics.json — aggregate metrics. - per_query.csv — per-query results. - failure_analysis.md — list of failing queries with hypothesized cause. - manifest.json — model name, eval set hash, code commit.

Acceptance

  • All 5 tests pass.
  • hit@5 ≥ 0.50 on the eval set (loose target; 0.65 is the comfortable bar).
  • Per-query CSV exists with rank-and-score for every query.
  • Failure analysis lists at least 3 failure patterns.

Pitfalls

  • Embedder downloaded uncached. First run downloads ~117MB. Pin HF_HOME in the manifest so reruns are deterministic.
  • L2-normalization missing. Without it, cosine similarity ≠ dot product, and FlatIndex.search returns wrong rankings. Always pass normalize_embeddings=True.
  • Embedding the chunk_id, not the chunk body. A common bug. Embed title + body, not the metadata.
  • Evaluation set leaks from KB. If your queries literally use the chunk titles, hit-rate will be artificially high. Paraphrase queries.
  • Spanish queries with EN embedder. Some embedders are EN-only. The recommended multilingual model handles both — confirm by sanity check (embedding "hello" and "hola" should give similar but not identical vectors).

Stretch

  • Try a larger embedder (paraphrase-multilingual-mpnet-base-v2, ~420MB, 768 dim) and compare. Diminishing returns at this scale.
  • Try a smaller embedder (all-MiniLM-L6-v2, EN-only). Quantify the loss on Spanish queries.
  • Replace cosine with L2 distance and verify the rankings change (they should — but not by much, since vectors are L2-normalized).

Next lab: lab/02-bm25-and-hybrid.md.