English · Español
Lab 03 — End-to-end RAG: reader + CLI + faithfulness¶
🇪🇸 Conectas el retriever híbrido a un reader — el grammar MiniGPT (Phase 17/24) con la LoRA de Phase 28 si la tienes, o el modelo base si no. La query entra, los chunks salen, el reader genera la respuesta con citas. Mides faithfulness: cuántas respuestas se apoyan en chunks reales y no fabrican rules.
Objective¶
Build the full verb-tutor CLI: takes a natural-language query, calls the hybrid retriever, passes the top-k chunks + query to the MiniGPT reader, returns an answer plus a list of cited chunk_ids. Evaluate end-to-end faithfulness on 30 queries.
Setup¶
- Labs 00-02 done.
- The grammar MiniGPT from Phase 17/24. Phase 28's LoRA adapter is optional; without it, the base model works (less polished answers, same pipeline).
theory/04-evaluation.mdfor faithfulness definition.
Tasks¶
Part A — The prompt template¶
data/kb/grammar-rules/reader_prompt.txt:
You are a bilingual grammar tutor. Use the rules below to answer the
student's question. Cite the rule by its chunk_id in square brackets,
like [chunk_id]. If the rules do not contain the answer, say so plainly
("I don't know from the provided rules") rather than guess.
Rules:
{rules_section}
Question: {query}
Answer:
Where {rules_section} is the concatenated top-k chunks formatted as:
[en-pres-3sg-regular-s-rule-001] Present simple, 3rd-person singular: add -s
In English present simple, the 3rd-person singular form of a regular verb adds -s...
[en-past-all-irregular-go-001] Irregular past: go → went
The verb "go" has an irregular past form "went"...
This template is frozen. Changes invalidate cross-run comparisons.
Part B — Implement src/minirag/generate.py¶
from pathlib import Path
from .retrieve import HybridRetriever
class Reader:
"""Wraps the grammar MiniGPT (Phase 17/24) with a prompt-based interface.
Phase 29 doesn't fine-tune the reader — that's Phase 28's LoRA. The reader
here accepts a fully-formatted prompt and returns generated text.
"""
def __init__(self, model_path: Path, max_new_tokens: int = 128):
# Load the Phase-24 PyTorch port, or the Phase-17 NumPy model
self.model = ... # load
self.tokenizer = ...
self.max_new_tokens = max_new_tokens
def generate(self, prompt: str) -> str:
ids = self.tokenizer.encode(prompt, return_tensors="pt")
out = self.model.generate(ids, max_new_tokens=self.max_new_tokens,
do_sample=False, temperature=1.0)
return self.tokenizer.decode(out[0][ids.shape[1]:], skip_special_tokens=True)
class RAGPipeline:
def __init__(self, retriever: HybridRetriever, reader: Reader,
prompt_template: str, k: int = 5):
self.retriever = retriever
self.reader = reader
self.template = prompt_template
self.k = k
self.chunks = retriever.dense.chunks # chunk_id → Chunk
def answer(self, query: str) -> dict:
results = self.retriever.search(query, k=self.k)
rules_section = self._format_rules(results)
prompt = self.template.format(rules_section=rules_section, query=query)
text = self.reader.generate(prompt)
cited = self._extract_citations(text)
return {
"query": query,
"retrieved": [{"chunk_id": cid, "score": s} for cid, s in results],
"answer": text,
"cited_chunks": cited,
}
def _format_rules(self, results):
lines = []
for cid, _ in results:
c = self.chunks[cid]
lines.append(f"[{cid}] {c.title}\n {c.body}")
return "\n\n".join(lines)
@staticmethod
def _extract_citations(text: str) -> list[str]:
import re
return re.findall(r"\[([a-z][\w-]+)\]", text)
Part C — The CLI¶
src/minirag/cli.py:
import argparse, json
from pathlib import Path
from .embed import Embedder
from .retrieve import DenseRetriever, HybridRetriever
from .bm25 import BM25, BM25Retriever
from .generate import Reader, RAGPipeline
from .chunk import load_chunks
def main():
parser = argparse.ArgumentParser()
sub = parser.add_subparsers(dest="cmd")
ask = sub.add_parser("ask")
ask.add_argument("query", type=str)
ask.add_argument("--k", type=int, default=5)
args = parser.parse_args()
if args.cmd == "ask":
embedder = Embedder()
dense = DenseRetriever(Path("data/kb/grammar-rules/chunks.jsonl"), embedder)
bm25 = BM25Retriever(Path("data/kb/grammar-rules/chunks.jsonl"))
retriever = HybridRetriever(dense, bm25)
reader = Reader(Path("models/grammar-minigpt"))
template = Path("data/kb/grammar-rules/reader_prompt.txt").read_text()
pipeline = RAGPipeline(retriever, reader, template, k=args.k)
out = pipeline.answer(args.query)
print(json.dumps(out, indent=2, ensure_ascii=False))
if __name__ == "__main__":
main()
Registered as verb-tutor in pyproject.toml:
Part D — Faithfulness evaluation¶
data/kb/grammar-rules/eval/faithfulness.jsonl — 30 queries, each with:
{
"query_id": "q-001",
"query": "How do I conjugate 'work' for she in present tense?",
"expected_chunks": ["en-pres-3sg-regular-s-rule-001"],
"expected_answer_pattern": "works",
"must_cite": ["en-pres-3sg-regular-s-rule-001"]
}
Faithfulness has two sub-metrics:
- Citation correctness: the model cites at least one chunk from
must_cite. Score 1 if yes, 0 if no. - Pattern match: the model's answer contains
expected_answer_pattern(case-insensitive substring). Score 1 if yes, 0 if no.
Faithfulness score = mean over queries of (citation_correct AND pattern_match).
DoD: faithfulness ≥ 0.70.
def evaluate_faithfulness(pipeline, queries):
rows = []
for q in queries:
out = pipeline.answer(q["query"])
citation_ok = any(c in q["must_cite"] for c in out["cited_chunks"])
pattern_ok = q["expected_answer_pattern"].lower() in out["answer"].lower()
rows.append({
"query_id": q["query_id"],
"answer": out["answer"],
"cited_chunks": out["cited_chunks"],
"citation_ok": int(citation_ok),
"pattern_ok": int(pattern_ok),
"faithful": int(citation_ok and pattern_ok),
})
return rows
Part E — Manual qualitative review¶
After the metric is computed, read the 5-10 worst answers. Categorize the failure modes:
- Wrong cite, right answer: reader produced the right form but cited the wrong rule. Retriever issue.
- Right cite, wrong answer: reader had the right rule but generated incorrect text. Reader issue.
- No cite, no answer: reader said "I don't know" or refused. Could be correct behavior (KB doesn't have the rule) — verify manually.
- Wrong cite, wrong answer: both broken. Worst case.
- Hallucinated cite: reader cited a chunk_id that doesn't exist in the KB. Bad. Should never happen with a well-prompted reader.
Document in experiments/29-e2e/FAILURE_MODES.md.
Part F — End-to-end report¶
experiments/29-e2e/REPORT.md:
- Pipeline diagram (query → retrieve → format prompt → reader → parse cites → answer).
- Aggregate metrics: hit@5, MRR (from retriever), citation accuracy, pattern match, faithfulness.
- Side-by-side: 3 queries with full retrieved chunks + generated answer + faithfulness verdict.
- Failure modes table (Part E).
- One paragraph: "what works, what doesn't, and what I'd change."
- What I would not change: explicitly call out 1-2 things the user might think need fixing but actually behaved correctly (e.g., "the model correctly refused to answer queries outside §A13 scope — that's working as intended").
Part G — Tests in tests/minirag/test_e2e.py¶
test_prompt_format— given a fixture retriever returning 2 chunks, the rendered prompt contains both chunk titles and both bodies.test_extract_citations— given a text"... see [en-pres-3sg-rule-001] and [es-past-001] ...",_extract_citationsreturns["en-pres-3sg-rule-001", "es-past-001"].test_faithfulness_aggregator— given known per-query verdicts, the aggregate is computed correctly.test_cli_smoke— runverb-tutor ask "test"and verify the JSON output has the four required keys (query,retrieved,answer,cited_chunks).test_no_hallucinated_citation— on the 30 eval queries, every citation in the output exists in the KB's chunk_ids. (Hard fail if not.)
Deliverable¶
experiments/29-e2e/:
- REPORT.md — full pipeline report.
- FAILURE_MODES.md — categorized failures.
- metrics.json — aggregate numbers.
- per_query.jsonl — full per-query records (query, retrieved, answer, cites, scores).
- manifest.json.
learners/borja/phase-29/reflections.md — 300-500 words on what RAG bought you compared to a closed-book model trained alone, and where the pipeline shows its seams.
Acceptance¶
- All 5 tests pass.
- Faithfulness ≥ 0.70 (DoD target).
- Zero hallucinated citations across the 30 eval queries.
verb-tutor ask "what's the past participle of eat?"returns a sensible answer with a real chunk cited.
Pitfalls¶
- Reader hallucinating chunk_ids. If the model fabricates ids (e.g., cites
[en-pres-3sg-make-001]when no such chunk exists), your prompt isn't constraining hard enough. Add the explicit instruction: "Cite chunk_ids exactly as shown. Do not invent new chunk_ids." - Prompt too long for the model's context. Grammar MiniGPT (Phase 17) was trained on short sequences. With 5 chunks × 600 chars + query + boilerplate, you may exceed the model's positional encoding range. Drop to k=3 or use a chunk-summary mode.
- Citation regex too greedy. A regex like
\[(.+?)\]will match[any-arbitrary-text]. Constrain it:\[([a-z][\w-]+)\]so only kebab-case ids match. - Faithfulness measured purely by citation, not answer correctness. A model that only cites the right chunk but produces an irrelevant answer still scores 0.5 (citation_ok=1, pattern_ok=0). The AND combination prevents that — both must hold.
- Generated text is deterministic but vacuous. With
do_sample=False, the reader can produce empty or one-word answers. Setmin_new_tokensto encourage substantive output, or sample at low temperature (T=0.3). must_citeis too strict. Some queries genuinely have multiple correct answers (e.g., a question about "go" could cite either the past-form chunk or the past-participle chunk). Allowmust_citeto be a set, and the citation_ok metric is "any overlap with must_cite".
Stretch¶
- Add a cross-encoder reranker (theory 03) between hybrid retrieval and reader. Measure the faithfulness lift.
- Implement HyDE (Hypothetical Document Embeddings): generate a fake answer first, embed it, retrieve with the fake. Compare to standard query embedding.
- Add a refusal-quality metric: for queries that the KB can't answer (deliberately out-of-scope), the model should refuse. Score correct refusals.
End of Phase 29 labs. Time to write PHASE_29_REPORT.md and prep for Phase 30.
Next: Phase 30 — Structured Generation & Constrained Decoding.