Skip to content

English · Español

00 — Why RAG: Closed-Book vs Open-Book LLMs

🇪🇸 RAG es una opción arquitectónica, no un truco. Decide dónde vive el conocimiento: en los pesos (closed-book) o en una base externa (open-book). Hay tradeoffs en frescura, escala, actualización, atribución, alucinación. La intuición es: meter todo lo memorizable en un índice barato, y dejar al modelo razonar sobre lo que se le entrega.


The closed-book picture

A pretrained language model stores its knowledge implicitly in its weights. Ask GPT-4 "what's the capital of France?", and it answers "Paris" by virtue of having seen that fact (in many forms) during pretraining. The model is closed-book: at inference time, no external knowledge is consulted.

This works well when:

  • The knowledge is dense in the training distribution.
  • The knowledge doesn't change frequently.
  • You don't need to cite a source.
  • The total knowledge set fits comfortably in the model's parameters.

It works badly when:

  • The knowledge is rare in the training distribution (long-tail facts).
  • The knowledge is fresh (post-training cutoff).
  • You need citations for verifiability or audit.
  • The total knowledge set is larger than the model can memorize (e.g., your company's 10M-document corpus).
  • You need updates without retraining (knowledge changes; you can't retrain a 7B model every week).

The open-book alternative

Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) inverts the storage strategy:

  1. Store the knowledge in a searchable external index (vector DB, BM25 corpus, etc.).
  2. At query time, retrieve the most relevant chunks.
  3. Inject the retrieved chunks into the model's context.
  4. The model generates an answer grounded in the retrieved chunks.

The model's role shifts from "remember the fact" to "reason over the given context". This is why even small LLMs can be effective with good retrieval: they don't need to know the answer themselves — they need to read it correctly and respond.

The two systems perspective

Modern RAG is two systems:

  • A retriever (cheap, fast, scales to billions of documents).
  • A reader/generator (expensive, slow, scales with quality of attention).

The retriever's job: get the relevant context into the reader's window with high recall@k. The reader's job: synthesize an answer from that context.

This decoupling is the win. Each system can be improved independently. Each can use different model families (sentence-transformers for retrieval; LLaMA-7B for reading). And each is independently evaluable (hit-rate for retrieval; faithfulness for reading).

When RAG dominates

For our Phase 29 use case — answering English-verb-grammar questions — RAG is the obvious choice for several reasons:

  1. Rare facts: MiniGPT was trained on a tiny corpus. It hasn't seen, e.g., "the past participle of write is written" thousands of times. RAG lets us inject this fact reliably.
  2. Citations: A grammar tutor that says "the past participle of write is written (rule: Irregular Past Participles, §3.2)" is much more trustworthy than one that says "the past participle of write is written" alone.
  3. Updatable: Add a new verb to the KB — no retraining needed.
  4. Auditable: When the model gives a wrong answer, you can check whether retrieval failed (wrong chunks) vs reading failed (right chunks, wrong reasoning).

When closed-book is enough

For a high-volume, hot-path task with stable knowledge — e.g., "is this English sentence grammatical?" — a fine-tuned model (Phase 28's LoRA path) may be faster and cheaper than RAG. There's no retrieval latency; no external system to manage. Lab 02 of Phase 28 already showed: LoRA fine-tuning of MiniGPT on irregular verbs can lift task accuracy without retrieval.

The choice between fine-tuning and RAG isn't either-or. Many production systems do both: fine-tune for behaviour (output format, style, calibration), retrieve for facts (the actual answer content).

For Phase 29 we use RAG alone for the verb-tutor task to keep the pipeline minimal. In Phase 32 (agent) we revisit the fine-tune + retrieve combination.

Why fine-tuning the knowledge in fails

A reasonable instinct: "instead of RAG, fine-tune MiniGPT on the grammar rules. They're tiny — 50 chunks. The model can memorize them."

Two reasons this fails at scale:

  1. It conflates knowledge with behaviour. The model's fine-tune loss optimizes for generating the rule's text — not for citing it or for answering questions about it. A model that memorizes "the past participle of eat is eaten" can still answer "what's the past participle of eat?" with a wrong guess if the retrieval prompt isn't well-formed during fine-tuning.
  2. It doesn't update. Every change to the rules requires re-fine-tuning. RAG just updates the KB.

At MiniGPT scale these are theoretical concerns (the corpus is small). At production scale (10M docs, weekly updates) they're decisive.

The architecture we'll build

       ┌─────────────────────────────────────┐
       │              Query                  │
       │  "past participle of eat?"          │
       └─────────────────────────────────────┘
       ┌─────────────────────────────────────┐
       │   Embedder (bi-encoder)             │
       │   q_vec ∈ ℝ^384                     │
       └─────────────────────────────────────┘
       ┌─────────────────────────────────────┐
       │   Vector index (FAISS flat)         │
       │   top-k by cosine similarity        │
       │   k = 10                            │
       └─────────────────────────────────────┘
       ┌─────────────────────────────────────┐   ┌──────────────┐
       │   BM25 index                        │   │  RRF fuse    │
       │   top-k by BM25 score               │──▶│  (k = 60)    │
       │   k = 10                            │   │              │
       └─────────────────────────────────────┘   └──────────────┘
                                            ┌─────────────────────────┐
                                            │  Reranker (cross-enc)   │
                                            │  top-3 from top-10      │
                                            └─────────────────────────┘
                                            ┌─────────────────────────┐
                                            │  Reader (MiniGPT+LoRA)  │
                                            │  prompt = top-3 + query │
                                            │  → answer + citations   │
                                            └─────────────────────────┘

Each component is independently testable. Each can be ablated.

A note on hallucination

A common failure of LLMs: hallucinating facts not in training data. RAG reduces hallucination not because it makes the model "more honest" but because the model has the right facts in its context. The model is then less likely to invent.

But RAG doesn't eliminate hallucination. A model can still:

  • Misread the retrieved context.
  • Add details not in the context.
  • Confidently answer when the context doesn't contain the answer (the right behaviour is "I don't know").

Faithfulness as a metric measures whether the answer is supported by the retrieved context. Lab 03 measures it.

Drill problems

Solutions at phase open in solutions/00-motivation-ref.md.

  1. Suppose you have a 7B closed-book LLM and a 7B LLM-plus-RAG with a 10M-document KB. For which tasks does each excel? Give one example per task.
  2. The bi-encoder retrieves the top-10 chunks. The reader gets all 10 in context. What goes wrong if the embedder retrieves wrong chunks 7 out of 10 times? (Hint: the model can be confused by irrelevant context.)
  3. What's the difference between faithfulness ("answer cites the context") and accuracy ("answer is correct")? Can you have one without the other? Give a scenario for each.
  4. Argue for one of: (a) "always use RAG", (b) "always fine-tune", © "use both — fine-tune for behaviour, RAG for knowledge". Defend with a concrete example task.

One-paragraph recap

RAG decouples knowledge storage from reasoning: store the facts in a searchable index, retrieve relevant ones at query time, inject them into the reader model's context. The win is decomposition — retriever and reader can be improved, swapped, and evaluated independently. It excels when knowledge is rare, fresh, or large; closed-book LLMs win when knowledge is dense, stable, and small. The architecture is two-stage: dense embedding for recall + cross-encoder reranking for precision + a reader (LLM) for synthesis. Phase 29 implements this minimally: small KB, small models, fully CPU. The pedagogical reward is feeling each component's contribution to end-to-end quality.

Next: theory/01-embeddings-and-biencoders.md.