Skip to content

English · Español

Lab 03 — End-to-end RAG: reader + CLI + faithfulness

🇪🇸 Conectas el retriever híbrido a un reader — el grammar MiniGPT (Phase 17/24) con la LoRA de Phase 28 si la tienes, o el modelo base si no. La query entra, los chunks salen, el reader genera la respuesta con citas. Mides faithfulness: cuántas respuestas se apoyan en chunks reales y no fabrican rules.

Objective

Build the full verb-tutor CLI: takes a natural-language query, calls the hybrid retriever, passes the top-k chunks + query to the MiniGPT reader, returns an answer plus a list of cited chunk_ids. Evaluate end-to-end faithfulness on 30 queries.

Setup

  • Labs 00-02 done.
  • The grammar MiniGPT from Phase 17/24. Phase 28's LoRA adapter is optional; without it, the base model works (less polished answers, same pipeline).
  • theory/04-evaluation.md for faithfulness definition.

Tasks

Part A — The prompt template

data/kb/grammar-rules/reader_prompt.txt:

You are a bilingual grammar tutor. Use the rules below to answer the
student's question. Cite the rule by its chunk_id in square brackets,
like [chunk_id]. If the rules do not contain the answer, say so plainly
("I don't know from the provided rules") rather than guess.

Rules:
{rules_section}

Question: {query}

Answer:

Where {rules_section} is the concatenated top-k chunks formatted as:

[en-pres-3sg-regular-s-rule-001] Present simple, 3rd-person singular: add -s
  In English present simple, the 3rd-person singular form of a regular verb adds -s...

[en-past-all-irregular-go-001] Irregular past: go → went
  The verb "go" has an irregular past form "went"...

This template is frozen. Changes invalidate cross-run comparisons.

Part B — Implement src/minirag/generate.py

from pathlib import Path
from .retrieve import HybridRetriever

class Reader:
    """Wraps the grammar MiniGPT (Phase 17/24) with a prompt-based interface.

    Phase 29 doesn't fine-tune the reader — that's Phase 28's LoRA. The reader
    here accepts a fully-formatted prompt and returns generated text.
    """
    def __init__(self, model_path: Path, max_new_tokens: int = 128):
        # Load the Phase-24 PyTorch port, or the Phase-17 NumPy model
        self.model = ...   # load
        self.tokenizer = ...
        self.max_new_tokens = max_new_tokens

    def generate(self, prompt: str) -> str:
        ids = self.tokenizer.encode(prompt, return_tensors="pt")
        out = self.model.generate(ids, max_new_tokens=self.max_new_tokens,
                                  do_sample=False, temperature=1.0)
        return self.tokenizer.decode(out[0][ids.shape[1]:], skip_special_tokens=True)


class RAGPipeline:
    def __init__(self, retriever: HybridRetriever, reader: Reader,
                 prompt_template: str, k: int = 5):
        self.retriever = retriever
        self.reader = reader
        self.template = prompt_template
        self.k = k
        self.chunks = retriever.dense.chunks   # chunk_id → Chunk

    def answer(self, query: str) -> dict:
        results = self.retriever.search(query, k=self.k)
        rules_section = self._format_rules(results)
        prompt = self.template.format(rules_section=rules_section, query=query)
        text = self.reader.generate(prompt)
        cited = self._extract_citations(text)
        return {
            "query": query,
            "retrieved": [{"chunk_id": cid, "score": s} for cid, s in results],
            "answer": text,
            "cited_chunks": cited,
        }

    def _format_rules(self, results):
        lines = []
        for cid, _ in results:
            c = self.chunks[cid]
            lines.append(f"[{cid}] {c.title}\n  {c.body}")
        return "\n\n".join(lines)

    @staticmethod
    def _extract_citations(text: str) -> list[str]:
        import re
        return re.findall(r"\[([a-z][\w-]+)\]", text)

Part C — The CLI

src/minirag/cli.py:

import argparse, json
from pathlib import Path
from .embed import Embedder
from .retrieve import DenseRetriever, HybridRetriever
from .bm25 import BM25, BM25Retriever
from .generate import Reader, RAGPipeline
from .chunk import load_chunks

def main():
    parser = argparse.ArgumentParser()
    sub = parser.add_subparsers(dest="cmd")
    ask = sub.add_parser("ask")
    ask.add_argument("query", type=str)
    ask.add_argument("--k", type=int, default=5)
    args = parser.parse_args()

    if args.cmd == "ask":
        embedder = Embedder()
        dense = DenseRetriever(Path("data/kb/grammar-rules/chunks.jsonl"), embedder)
        bm25 = BM25Retriever(Path("data/kb/grammar-rules/chunks.jsonl"))
        retriever = HybridRetriever(dense, bm25)
        reader = Reader(Path("models/grammar-minigpt"))
        template = Path("data/kb/grammar-rules/reader_prompt.txt").read_text()
        pipeline = RAGPipeline(retriever, reader, template, k=args.k)
        out = pipeline.answer(args.query)
        print(json.dumps(out, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    main()

Registered as verb-tutor in pyproject.toml:

[project.scripts]
verb-tutor = "minirag.cli:main"

Part D — Faithfulness evaluation

data/kb/grammar-rules/eval/faithfulness.jsonl — 30 queries, each with:

{
  "query_id": "q-001",
  "query": "How do I conjugate 'work' for she in present tense?",
  "expected_chunks": ["en-pres-3sg-regular-s-rule-001"],
  "expected_answer_pattern": "works",
  "must_cite": ["en-pres-3sg-regular-s-rule-001"]
}

Faithfulness has two sub-metrics:

  1. Citation correctness: the model cites at least one chunk from must_cite. Score 1 if yes, 0 if no.
  2. Pattern match: the model's answer contains expected_answer_pattern (case-insensitive substring). Score 1 if yes, 0 if no.

Faithfulness score = mean over queries of (citation_correct AND pattern_match).

DoD: faithfulness ≥ 0.70.

def evaluate_faithfulness(pipeline, queries):
    rows = []
    for q in queries:
        out = pipeline.answer(q["query"])
        citation_ok = any(c in q["must_cite"] for c in out["cited_chunks"])
        pattern_ok = q["expected_answer_pattern"].lower() in out["answer"].lower()
        rows.append({
            "query_id": q["query_id"],
            "answer": out["answer"],
            "cited_chunks": out["cited_chunks"],
            "citation_ok": int(citation_ok),
            "pattern_ok": int(pattern_ok),
            "faithful": int(citation_ok and pattern_ok),
        })
    return rows

Part E — Manual qualitative review

After the metric is computed, read the 5-10 worst answers. Categorize the failure modes:

  • Wrong cite, right answer: reader produced the right form but cited the wrong rule. Retriever issue.
  • Right cite, wrong answer: reader had the right rule but generated incorrect text. Reader issue.
  • No cite, no answer: reader said "I don't know" or refused. Could be correct behavior (KB doesn't have the rule) — verify manually.
  • Wrong cite, wrong answer: both broken. Worst case.
  • Hallucinated cite: reader cited a chunk_id that doesn't exist in the KB. Bad. Should never happen with a well-prompted reader.

Document in experiments/29-e2e/FAILURE_MODES.md.

Part F — End-to-end report

experiments/29-e2e/REPORT.md:

  1. Pipeline diagram (query → retrieve → format prompt → reader → parse cites → answer).
  2. Aggregate metrics: hit@5, MRR (from retriever), citation accuracy, pattern match, faithfulness.
  3. Side-by-side: 3 queries with full retrieved chunks + generated answer + faithfulness verdict.
  4. Failure modes table (Part E).
  5. One paragraph: "what works, what doesn't, and what I'd change."
  6. What I would not change: explicitly call out 1-2 things the user might think need fixing but actually behaved correctly (e.g., "the model correctly refused to answer queries outside §A13 scope — that's working as intended").

Part G — Tests in tests/minirag/test_e2e.py

  1. test_prompt_format — given a fixture retriever returning 2 chunks, the rendered prompt contains both chunk titles and both bodies.
  2. test_extract_citations — given a text "... see [en-pres-3sg-rule-001] and [es-past-001] ...", _extract_citations returns ["en-pres-3sg-rule-001", "es-past-001"].
  3. test_faithfulness_aggregator — given known per-query verdicts, the aggregate is computed correctly.
  4. test_cli_smoke — run verb-tutor ask "test" and verify the JSON output has the four required keys (query, retrieved, answer, cited_chunks).
  5. test_no_hallucinated_citation — on the 30 eval queries, every citation in the output exists in the KB's chunk_ids. (Hard fail if not.)

Deliverable

experiments/29-e2e/: - REPORT.md — full pipeline report. - FAILURE_MODES.md — categorized failures. - metrics.json — aggregate numbers. - per_query.jsonl — full per-query records (query, retrieved, answer, cites, scores). - manifest.json.

learners/borja/phase-29/reflections.md — 300-500 words on what RAG bought you compared to a closed-book model trained alone, and where the pipeline shows its seams.

Acceptance

  • All 5 tests pass.
  • Faithfulness ≥ 0.70 (DoD target).
  • Zero hallucinated citations across the 30 eval queries.
  • verb-tutor ask "what's the past participle of eat?" returns a sensible answer with a real chunk cited.

Pitfalls

  • Reader hallucinating chunk_ids. If the model fabricates ids (e.g., cites [en-pres-3sg-make-001] when no such chunk exists), your prompt isn't constraining hard enough. Add the explicit instruction: "Cite chunk_ids exactly as shown. Do not invent new chunk_ids."
  • Prompt too long for the model's context. Grammar MiniGPT (Phase 17) was trained on short sequences. With 5 chunks × 600 chars + query + boilerplate, you may exceed the model's positional encoding range. Drop to k=3 or use a chunk-summary mode.
  • Citation regex too greedy. A regex like \[(.+?)\] will match [any-arbitrary-text]. Constrain it: \[([a-z][\w-]+)\] so only kebab-case ids match.
  • Faithfulness measured purely by citation, not answer correctness. A model that only cites the right chunk but produces an irrelevant answer still scores 0.5 (citation_ok=1, pattern_ok=0). The AND combination prevents that — both must hold.
  • Generated text is deterministic but vacuous. With do_sample=False, the reader can produce empty or one-word answers. Set min_new_tokens to encourage substantive output, or sample at low temperature (T=0.3).
  • must_cite is too strict. Some queries genuinely have multiple correct answers (e.g., a question about "go" could cite either the past-form chunk or the past-participle chunk). Allow must_cite to be a set, and the citation_ok metric is "any overlap with must_cite".

Stretch

  • Add a cross-encoder reranker (theory 03) between hybrid retrieval and reader. Measure the faithfulness lift.
  • Implement HyDE (Hypothetical Document Embeddings): generate a fake answer first, embed it, retrieve with the fake. Compare to standard query embedding.
  • Add a refusal-quality metric: for queries that the KB can't answer (deliberately out-of-scope), the model should refuse. Score correct refusals.

End of Phase 29 labs. Time to write PHASE_29_REPORT.md and prep for Phase 30.

Next: Phase 30 — Structured Generation & Constrained Decoding.