Skip to content

English · Español

05 — Build-It-Yourself RAG: the End-to-End Spec (no langchain, no llama-index)

🇪🇸 No usamos langchain ni llama-index — son anti-goal del repo. Esta sección describe cada eslabón del pipeline RAG como módulo Python autónomo: chunker → embedder → vector store → retriever → augmented prompt → reader. Cada eslabón tiene una API estable, un coste medido, y un test de aislamiento.

Anchors: CLAUDE.md §0.4 (no langchain), theory/00-motivation.md (closed- vs open-book), LYNX_CORTEX.md §10 (anti-goals).


Why we don't use a framework

The langchain / llama-index anti-goal isn't aesthetics. The pedagogical contract demands you see every call site:

  • which chunker split which document and why;
  • which embedding model produced which vector;
  • which similarity metric was applied;
  • what got into the prompt and in what order.

Frameworks hide all of that behind chain abstractions ("retriever | prompt | model"). For Borja's understanding, every component must be a function you wrote, importable from src/minirag/, with a docstring you can read.

The cost of writing it ourselves: ~400 lines of Python total. The win: full control over the failure modes you'll measure in Phase 32.

The pipeline as a directed graph

                       ┌──────────────────────────────────────────────┐
                       │  Knowledge base (data/rag-kb/*.md, ~50 files)│
                       └──────────────────────────────────────────────┘
       ┌────────────────────────┐    ┌──────────────────────────┐
   ┌──▶│  Chunker               │───▶│  chunks: list[Chunk]     │
   │   │  src/minirag/chunk.py  │    │  Chunk = {id, text, meta}│
   │   └────────────────────────┘    └──────────────────────────┘
   │                                          │
   │                                          ▼
   │   ┌────────────────────────┐    ┌──────────────────────────┐
   │   │  Embedder              │───▶│  vectors: (N, d) np.ndarr│
   │   │  src/minirag/embed.py  │    │  + chunk.id index        │
   │   └────────────────────────┘    └──────────────────────────┘
   │                                          │
   │                                          ▼
   │   ┌────────────────────────┐
   │   │  Vector store (in-mem) │   one-time build
   │   │  src/minirag/store.py  │
   │   └────────────────────────┘
   │                                          │  query time
   │                                          ▼
   │   ┌──────────────┐    ┌────────────┐    ┌──────────────────┐
Query─┴─▶│  Embedder   │───▶│ Retriever  │───▶│  top-k Chunks   │
        └──────────────┘    │  cosine    │    └──────────────────┘
                            │  search    │           │
                            └────────────┘           ▼
                                              ┌──────────────────┐
                                              │ Prompt augmenter │
                                              │ src/minirag/prompt.py
                                              └──────────────────┘
                                              ┌──────────────────┐
                                              │ Reader (MiniGPT) │
                                              │ src/minimodel/...│
                                              └──────────────────┘
                                              ┌──────────────────┐
                                              │  Answer + cites  │
                                              └──────────────────┘

Five files in src/minirag/. No external deps beyond numpy and the Phase-17 MiniGPT.

Component 1: Chunker

Job. Split documents into 100-300 token chunks with overlap, preserving metadata.

API (signature only — Borja implements):

@dataclass(frozen=True)
class Chunk:
    id: str            # stable hash of doc_id + offset
    text: str          # the chunk content
    doc_id: str        # path to source document
    offset: int        # token offset within doc_id
    metadata: dict[str, Any]   # e.g., {"section": "Irregular Verbs"}

def chunk_document(doc_path: Path, target_tokens: int = 150, overlap: int = 20) -> list[Chunk]:
    """Split a markdown doc into overlapping token-bounded chunks."""

Implementation notes.

  • Tokenization: use the Phase 11 BPE tokenizer (src/minitokenizer/bpe.py). Do not import an unrelated tokenizer just because RAG papers use one.
  • Boundary respect: prefer to break on \n\n or sentence boundaries within ±10% of target_tokens. Pure fixed-window chunking is the fallback.
  • Stable IDs: chunk.id = sha256(doc_path + ':' + offset).hexdigest()[:12]. Deterministic across runs.

Test isolation. Round-trip: chunk doc → re-assemble overlapping chunks → assert original text reconstructable up to overlap region.

Cost. O(L) where L is doc length. Single-pass; runs in milliseconds for the §A13 grammar KB.

Component 2: Embedder

Job. Map a list of strings to a (N, d) matrix of float32 vectors.

API:

class Embedder:
    """Pluggable embedder. For Phase 29 default: a hand-trained bi-encoder on §A13."""

    d: int   # output vector dim

    def embed(self, texts: list[str]) -> np.ndarray:    # (len(texts), self.d)
        ...

For Phase 29 we ship two backends:

  1. MiniBiEncoder — a small two-layer MLP that pools BPE token embeddings, trained briefly with a contrastive loss (in-batch negatives). Implements the bi-encoder pattern from theory/01-embeddings-and-biencoders.md. Default d = 64 (matches Mini-GPT's hidden dim — easier to debug end-to-end).
  2. HashingEmbedder — a 256-dim baseline that does TF-IDF hashing. No training. Used as a control: if MiniBiEncoder doesn't beat HashingEmbedder, retrieval is broken.

Test isolation. Embedding stability: embedder.embed(["work"]) returns the same vector every run (seed-controlled).

Cost. O(N · d) for embedding N chunks; O(d) per query.

Component 3: Vector store

Job. Hold (N, d) matrix + parallel array of Chunk.ids. Support search(query_vec, k) → top-k chunks.

API:

class FlatVectorStore:
    """O(N) brute-force cosine search. Correct, slow, easy to debug.
    For Mini-GPT-scale (~50 chunks) this is faster than any tree-based index."""

    def __init__(self, embedder: Embedder):
        self.embedder = embedder
        self.vectors: np.ndarray = np.empty((0, embedder.d), dtype=np.float32)
        self.chunks: list[Chunk] = []

    def add(self, chunks: list[Chunk]) -> None: ...
    def search(self, query_text: str, k: int = 5) -> list[tuple[Chunk, float]]: ...

Implementation: cosine on normalized vectors:

def search(self, query_text: str, k: int = 5) -> list[tuple[Chunk, float]]:
    q = self.embedder.embed([query_text])[0]
    q /= np.linalg.norm(q) + 1e-9
    norms = np.linalg.norm(self.vectors, axis=1, keepdims=True) + 1e-9
    sims = (self.vectors / norms) @ q       # (N,)
    top_k_idx = np.argsort(-sims)[:k]
    return [(self.chunks[i], float(sims[i])) for i in top_k_idx]

Why not FAISS / HNSW? They are correct (we'll read FAISS in lab 04 of an extended setup), but for ~50 chunks the brute-force cosine is the optimal data structure. HNSW pays its log-factor advantage only above ~10⁴ chunks. We add tree-based indexes only when the scale demands it; in Phase 29 the linear scan is the right choice on Borja's CPU.

Test isolation. Empty store returns []. Single-chunk store returns [chunk] with similarity 1.0 for an exact-match query.

Component 4: Retriever

Job. Combine vector store search with (optionally) BM25 lexical search, returning top-k chunks ranked.

API:

class Retriever:
    def __init__(self, store: FlatVectorStore, bm25: BM25Index | None = None):
        self.store = store
        self.bm25 = bm25

    def retrieve(self, query: str, k: int = 3, hybrid: bool = False) -> list[Chunk]:
        ...

Hybrid (when hybrid=True): apply Reciprocal Rank Fusion over the two rankings:

\[ \text{RRF}(c) = \sum_{r \in \{\text{dense}, \text{bm25}\}} \frac{1}{60 + \text{rank}_r(c)} \]

The constant 60 is Cormack et al.'s default. Tiny chunks of code; the math is trivial.

Why hybrid? Dense retrieval misses exact-string matches; BM25 misses paraphrase. For §A13 grammar queries the §"past tense of eat" verbatim match is BM25-friendly while a paraphrased "what's the simple past of the verb to eat" is dense-friendly. Production systems almost always blend.

Test isolation. A query that contains the exact chunk content should retrieve that chunk first under both backends.

Component 5: Prompt augmenter

Job. Format retrieved chunks + the user query into a prompt the reader model can ingest.

API:

def build_prompt(query: str, chunks: list[Chunk], template: str = DEFAULT_TEMPLATE) -> str:
    """Render the prompt; include citation markers [#1], [#2] keyed to chunk ids."""

Default template (every byte intentional — this template determines what the model can cite):

You are a grammar tutor. Use only the facts in the context to answer.
If the answer is not in the context, say "I don't know — not in the rules I have."

Context:
[#1] {chunks[0].text}
[#2] {chunks[1].text}
[#3] {chunks[2].text}

Question: {query}
Answer with a one-sentence response, followed by the citation in brackets.

Constraints to track:

  • Prompt length ≤ Mini-GPT's context_length = 64 (per Phase 17). For the chunker, this means target_tokens ≤ 12 per chunk in the worst case — small chunks for our small model. In a production setup this constraint relaxes to "≤ context length minus the answer budget".
  • Chunk order is rank-descending; readers attend more strongly to the last chunk by recency bias. Inverting the order is a measurable lab experiment.

Test isolation. build_prompt(q, []) returns a prompt that elicits the "I don't know" response — never falls back to closed-book mode silently.

Component 6: Reader

The Mini-GPT from Phase 17, optionally with the LoRA adapter from Phase 28. The reader takes the augmented prompt and emits an answer.

No new API. Reuses src/minimodel/mini_gpt.py's generate(prompt, max_new_tokens).

The reader is the only learned-parameter component in the chain. The retriever uses fixed embeddings (after one-time training); the chunker is rule-based; the prompt template is hand-written.

End-to-end call site

# At repo top-level: src/minirag/pipeline.py
def rag_answer(query: str, store: FlatVectorStore, reader: MiniGPT, k: int = 3) -> str:
    retriever = Retriever(store=store)
    chunks = retriever.retrieve(query, k=k)
    prompt = build_prompt(query, chunks)
    return reader.generate(prompt, max_new_tokens=20)

That's the entire RAG pipeline in 4 lines once the components exist. Read each function. Step through it in a debugger. There is no magic.

Evaluation hooks (what theory/04-evaluation.md measures)

  • Retrieval recall@k: for the §A13 eval set's 30 "verb conjugation" questions, the gold chunk should appear in the top-k of retriever.retrieve(...). Recall@5 should hit 0.95+.
  • Faithfulness: the answer's claim should be supported by the retrieved chunks. Lab 03-end-to-end-rag.md checks this manually for the §A13 set.
  • Latency: end-to-end answer < 500 ms on Borja's CPU for the §A13 corpus.

What this section does not cover

  • Reranking with cross-encoders. Covered in theory/03-hybrid-search-and-reranking.md and lab 02. Cross-encoders score (query, chunk) pairs jointly — more accurate than bi-encoder, much slower. Production stacks use them as the second stage.
  • Persistent vector stores. Our FlatVectorStore is in-memory. SQLite-backed or FAISS-on-disk variants are out of Phase 29 scope (they don't add new concepts at this scale).
  • Streaming retrieval. Answers as they're retrieved (e.g., for chat UIs) is a production-engineering pattern, not a RAG concept.

Why this composition does NOT reduce to "calling a framework"

Compare a langchain-like setup:

chain = retriever | prompt | model | parser

That's elegant but opaque. When the chain answers wrong, you don't know which step failed. You also don't know what the chunker chose, what the retriever ranked second, or what the prompt looks like in bytes.

The build-it-yourself pipeline above is the same logical chain, written so every intermediate value is a named variable. That's the design that lets lab/03-end-to-end-rag.md ask: "your retriever returned these 3 chunks — explain which one is the most relevant and which is a near-miss". You couldn't ask that of a langchain pipeline without unpacking it first.

Citations

  • Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020. arXiv:2005.11401.
  • Cormack, Clarke, Buettcher, Reciprocal Rank Fusion, SIGIR 2009 — the constant 60 has been the default for 15 years.
  • Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering, EMNLP 2020. arXiv:2004.04906 — the bi-encoder + dense-retrieval foundation.

One-paragraph recap

The end-to-end RAG pipeline is a six-stage chain: chunker → embedder → vector store → retriever → prompt augmenter → reader. Each stage is a named function in src/minirag/, ~50-80 lines of Python, with a stable signature and a test of isolation. The FlatVectorStore does brute-force cosine because it's the right data structure at 50-chunk scale; we earn the right to use HNSW only when the scale demands it. Hybrid retrieval blends dense and BM25 via RRF; the prompt template is hand-tuned for Mini-GPT's 64-token context. The total LOC is ~400 — half what a framework wrapper would weigh, and 100× more legible.

Next: lab/00-kb-curation.md to build the §A13 KB, then walk down the rest of the pipeline.