English · Español
Break 00 — Skip retrieval: feed the raw query straight to the reader¶
🇪🇸 Saltamos el
retriever.retrieve(...)y pasamos la consulta cruda al lector Mini-GPT, sin contexto. El modelo "responde" desde sus pesos, no desde la base de conocimientos. Comparamos contra el baseline con retrieval para medir cuánto cae la accuracy faithfulness en una pregunta lookup de §A13.
This /break exercise demonstrates the closed-book vs open-book trade-off concretely. The bug: the augmented-prompt step is bypassed. The failure mode is dramatic on a §A13 lookup task because Mini-GPT's parameters do not contain the irregular-verb tables.
Anchors: theory/00-motivation.md, theory/05-rag-from-scratch-spec.md, .claude/commands/break.md.
Hypothesis¶
The learner predicts: "The §A13 grammar KB chunk for write says: 'past participle: written. past simple: wrote.' If I retrieve and inject this chunk, Mini-GPT can read off the answer. If I skip retrieval, Mini-GPT has to recall the form from the few thousand training examples — and for the 8 irregular verbs, it gets several confused with regular conjugations ('writed', 'eated'). Accuracy will drop from ~95% with retrieval to ~50% without."
The break¶
In src/minirag/pipeline.py:
def rag_answer(query: str, store: FlatVectorStore, reader: MiniGPT, k: int = 3) -> str:
- retriever = Retriever(store=store)
- chunks = retriever.retrieve(query, k=k)
- prompt = build_prompt(query, chunks)
+ # /break: skip retrieval; feed the bare query.
+ prompt = query
return reader.generate(prompt, max_new_tokens=20)
Three lines deleted; one added. The pipeline now bypasses the chunker, embedder, store, retriever, and prompt augmenter — it's a closed-book call.
Predict, then run¶
The §A13 eval set has 30 lookup-style questions:
- "What is the past simple of
eat?" →ate(irregular) - "What is the past participle of
write?" →written(irregular) - "What is the third-person present of
study?" →studies(regular but-y → -ies)
Mini-GPT was trained on the §A13 corpus, which does contain these conjugations — but not exhaustively, and not in a "Q: ... A: ..." format. The closed-book model has seen wrote and written in context, but its ability to answer a direct question about them is weaker than its ability to complete a sentence containing them.
Predictions¶
- With retrieval (baseline): ~25/30 correct (83%). The 5 misses are usually phrasing-related, not factual.
- Without retrieval (broken): ~10-15/30 correct (33-50%). The model:
- regularizes irregular verbs ("writed" instead of "wrote") about 30% of the time;
- guesses the infinitive when asked about a tense it can't recall;
- sometimes returns confident garbage (3rd-person of
be→ "bees"). - Faithfulness metric: drops from ~1.0 (every claim grounded in a retrieved chunk) to ~0.0 (no chunks were retrieved). This is the cleaner signal — even when the closed-book answer happens to be right, it has nothing to cite.
Write your predictions in learners/borja/phase-29/notes/breaks.md before running.
Observe¶
Run the Phase 29 end-to-end eval, comparing modes:
just exp 29-rag --mode with-retrieval
just exp 29-rag --mode no-retrieval
diff experiments/<date>-29-rag-with-retrieval/answers.jsonl \
experiments/<date>-29-rag-no-retrieval/answers.jsonl
Diagnostics to plot:
- Bar chart: accuracy on the 30-question §A13 eval set, two bars. Expect ~83% vs ~40%.
- Confusion plot: for the 8 irregular verbs, fraction of "regularized" answers (incorrect
-edsuffix on irregulars). With retrieval ≈ 0.05; without ≈ 0.30. - Side-by-side answers on three example questions: "past simple of eat", "past participle of write", "3rd-person present of be". The closed-book answers are diagnostic.
Symptom Borja will see¶
- Accuracy drops ≥ 30 percentage points.
- Specific irregular verbs are systematically regularized.
- No retrieval-related logs — because retrieval was skipped entirely.
- The faithfulness metric collapses (no chunks → no grounded citations).
Hidden cause (one sentence)¶
The rag_answer function was modified to bypass the retriever and prompt augmenter, sending the bare query to Mini-GPT, which has no §A13 conjugation table in its weights to answer reliably from.
Hint cascade¶
- Search the logs for "retrieved chunks". Are any printed? If not, where did retrieval go?
- Print the actual prompt being sent to
reader.generate(...). Does it contain context chunks? - Compare the prompt to the template in
theory/05-rag-from-scratch-spec.md§"Default template". What's missing?
Fix diff¶
def rag_answer(query: str, store: FlatVectorStore, reader: MiniGPT, k: int = 3) -> str:
+ retriever = Retriever(store=store)
+ chunks = retriever.retrieve(query, k=k)
+ prompt = build_prompt(query, chunks)
- prompt = query
return reader.generate(prompt, max_new_tokens=20)
Why this teaches the concept¶
RAG's core claim is: "your model doesn't need to know the answer — it needs to read the answer." This break makes that claim load-bearing. With retrieval, Mini-GPT (which has 103 680 params, vastly smaller than what'd be needed to memorize a verb table) easily answers lookup questions. Without retrieval, the same model fails them — because the knowledge isn't in the weights. This is the closed-book vs open-book dichotomy in theory/00-motivation.md, made empirically real.
The lesson generalizes: large LLMs seem to memorize facts because they were trained on a sufficient sample of those facts, but the failure modes (hallucination, freshness) are exactly those a retrieval layer would have caught. Production systems with millions of facts (medical records, legal docs, company KBs) cannot afford to rely on parametric memory — RAG is the architectural answer.
Reference¶
- Lewis et al., RAG paper (arXiv:2005.11401), Table 4 — shows the answer-accuracy delta on TriviaQA when retrieval is ablated.
theory/00-motivation.md§"When closed-book is enough" — the conditions under which the closed-book path is actually viable (small, stable, dense knowledge — the opposite of irregular-verb tables).
Next: restore rag_answer and run the full lab/03-end-to-end-rag.md to chart the retrieval-on/off frontier.