English · Español

02 — Chunking Strategies and Vector Indexes¶

🇪🇸 Dos decisiones de implementación que la mayoría de RAGs malos pasan por alto: cómo se parten los documentos (chunking) y cómo se busca el vecino más cercano (index). Un mal chunker hace que la respuesta correcta aparezca dividida entre dos chunks. Un mal índice escala mal (O(N) cuando podría ser O(log N)).

Why chunking matters¶

A bi-encoder produces a fixed-dim vector per text. If you embed an entire 10-page document, that single vector compresses 10 pages of semantic content. The result: queries about specific parts of the document retrieve weakly — the vector is "averaged over too much".

So we chunk documents into smaller pieces before embedding. The size of the chunk is a tradeoff:

Too small (1 sentence): loses context. "The past participle is eaten" — past participle of what?
Too large (whole document): loses precision. Embedding averages too many concepts.
Just right: one logical unit per chunk. A complete rule statement, a single example, a definition.

For Phase 29's KB, "just right" is ~1-3 sentences per chunk, often corresponding to one grammar rule.

Chunking strategies¶

Fixed-size, no overlap¶

Take the document, split every N tokens. N = 256 is common.

Pro: simple, deterministic.
Con: splits across sentence boundaries; logical units torn apart.

Fixed-size, with overlap¶

Split every N tokens, but each chunk overlaps the previous by K tokens (K = 32-64 typical).

Pro: overlapping boundaries reduce the "ripped logical unit" problem.
Con: redundancy in the KB; same content appears in multiple chunks.

Sentence-boundary¶

Split on sentence boundaries (using nltk or regex). Each chunk is a sentence (or 2-3 sentences).

Pro: never breaks a sentence mid-clause.
Con: very short for terse content; very long for verbose content. Variable size.

Semantic chunking¶

Use a separate model (or heuristic) to detect "topic changes" between sentences; chunks are runs of topically-related sentences.

Pro: most aligned with how humans would chunk.
Con: more expensive to compute; depends on topic-detection quality.

Hierarchical / parent-child¶

Maintain both a fine-grained chunk (for embedding/retrieval) and a parent chunk (broader context, for the reader's prompt). Retrieve via child; pass parent to reader.

Pro: retrieval precision + reading context simultaneously.
Con: more storage; more complex pipeline.

For Phase 29 we use sentence-boundary chunking with manual rule alignment — the KB is small enough that we hand-curate chunks to align with grammar-rule units. Lab 00 covers KB construction.

What a good chunk looks like (for our KB)¶

Example: a chunk for the past participle of eat:

Rule: The past participle of the irregular verb "eat" is "eaten" (Spanish: comido). Used after auxiliary verbs have/has/had. Example: I have eaten dinner. (Spanish: He comido la cena.)

This chunk: - Is self-contained (you can understand the rule without context). - Has the lexical token "eat" + the lexical token "eaten" — BM25-friendly. - Has a worked example — pedagogically helpful for the reader to incorporate into its answer. - Is bilingual — surfaces Spanish pair per §A2.

A bad chunk:

The past participle is "eaten". Spanish: comido.

What's "eaten" the past participle of? The query "past participle of eat" embedded next to this chunk is weakly close — the literal verb "eat" doesn't appear.

The lesson: lexical anchoring inside chunks matters, especially for short-corpus RAG where the bi-encoder's general semantic understanding is bottlenecked by the embedding model's capacity.

Vector indexes: the structures¶

After chunking and embedding, you have a matrix D ∈ ℝ^{|KB| × d} of doc embeddings. A query q ∈ ℝ^d needs the top-k by similarity. Options:

Flat (exact NN)¶

Compute D @ q directly. O(|KB| × d) time. For |KB| = 50, d = 384: ~20K mults, instant.

For |KB| = 1M: 384M mults per query. Still tractable but not snappy.

For |KB| = 1B: not feasible per query.

FAISS-flat is the canonical implementation. Optimized BLAS does the dot products.

IVF (Inverted File Index)¶

Cluster the doc vectors into K clusters via k-means (offline). At query time:

Find the p clusters closest to q (O(K × d)).
Search exhaustively within those p clusters (O(|KB| × p/K × d)).

For K = sqrt(|KB|), p = 10: ~10× speedup over flat with ~1-2% recall loss.

Trades exactness for speed. Tunable.

HNSW (Hierarchical Navigable Small World)¶

A multi-layer graph where each node is a doc embedding. Edges connect "nearby" nodes. Query: start at top layer (sparse), greedy-walk to the nearest node, descend to next layer, repeat. Visit ~O(log |KB|) nodes per query.

Recall: typically 95-99% with proper tuning.
Build time: longer than flat; comparable to IVF.
Memory: ~1.5-2× the flat index size due to graph edges.
Query time: sub-linear in |KB|.

FAISS-HNSW is the standard implementation. Production systems use it for >1M-doc corpora.

LSH (Locality-Sensitive Hashing)¶

Hash-based; older technique. Less competitive than HNSW for high-dim embeddings. Mentioned for completeness; not used in modern systems.

For Phase 29: which index?¶

|KB| ≈ 50 docs, d = 384. The full dot-product matrix is 50 × 384 = 19200 entries (~76 KiB). FAISS-flat is dramatically overkill — even pure NumPy D @ q runs in microseconds.

So our production path is FAISS-flat. We implement HNSW in src/minirag/index.py for pedagogical purposes — to feel the difference at toy scale and to have a path to larger KBs.

Lab 02 measures: HNSW recall@5 on our 50-doc KB should be 100% (it's small enough that the approximate index finds everything exact). At toy scale, HNSW's only "cost" is the build time.

A FAISS-flat walkthrough¶

import faiss
import numpy as np

# Build (offline)
embeddings = encoder.encode(docs)        # shape (N, 384), float32
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
index = faiss.IndexFlatIP(384)            # inner-product index
index.add(embeddings)
# Save: faiss.write_index(index, "kb.faiss")

# Query (online)
q_emb = encoder.encode([query])           # shape (1, 384)
q_emb = q_emb / np.linalg.norm(q_emb)
scores, ids = index.search(q_emb, k=10)
# ids[0] is array of top-10 doc indices

IndexFlatIP is "inner product" — for unit-normalized vectors, equivalent to cosine. The add and search calls are both batch-friendly.

A FAISS-HNSW walkthrough¶

# Build
index = faiss.IndexHNSWFlat(384, M=32)    # M = number of edges per node
index.hnsw.efConstruction = 200             # build-time accuracy parameter
index.add(embeddings)

# Query
index.hnsw.efSearch = 50                    # query-time accuracy parameter
scores, ids = index.search(q_emb, k=10)

Tunable: M (graph density), efConstruction (build accuracy), efSearch (query accuracy). Higher = slower but more recall.

Persistence¶

faiss.write_index(index, path) + faiss.read_index(path). Plus a separate JSON-or-pickle of (doc_id → doc_text) mappings, since the index only stores vectors and integer IDs.

Convention in src/minirag/index.py: index.faiss (binary) + index_metadata.json (id → chunk_text + source + offsets).

What about embedding-and-index updates?¶

If you add a new doc to the KB:

FAISS-flat: index.add(new_embedding). Done.
FAISS-HNSW: index.add(new_embedding). The graph adjusts. (Removal is harder; needs a rebuild.)

For Phase 29, the KB is static. We rebuild on every change.

Per-doc metadata¶

A chunk in the KB carries more than text and embedding:

chunk_id: stable identifier (grammar-rules/irregular-verbs/eat-past-participle-v1).
source: the source document or category.
text: the chunk's actual text (for the reader's prompt).
embedding: the vector (in the FAISS index).
Optional: bm25_tokens (tokenized text for BM25).

src/minirag/index.py stores all of these in a dict[chunk_id, ChunkRecord] alongside the FAISS index.

Chunk size effects on retrieval quality¶

Empirically, for our KB:

1-sentence chunks: each chunk is small, but the embedding captures little context. Mid-range hit-rate.
1-rule chunks (2-3 sentences, hand-aligned): each chunk is one logical unit. Higher hit-rate.
10-rule chunks: chunks contain multiple rules; query for a specific rule may retrieve a chunk containing it but the relevant slice is buried. Lower hit-rate.

Lab 01's sensitivity check varies chunk size; we expect the elbow around "one rule per chunk".

Compression¶

For very large KBs, you might quantize the embeddings to save memory (8-bit or 4-bit). FAISS supports this via IndexIVFPQ (product quantization). For 50 docs at 384-dim fp32: 76 KiB — quantization buys nothing. Mentioned but not used.

Drill problems¶

Solutions at phase open in solutions/02-chunking-and-indexes-ref.md.

A 50-doc KB with 384-dim fp32 embeddings. Memory footprint? If we go to 8-bit embeddings, what's the new footprint? What's the precision cost (in approximate cosine error)?
HNSW has parameters M and efSearch. Argue: doubling efSearch doubles query time and increases recall toward 100%. What's the diminishing-returns regime?
Consider a chunk that contains both "the past participle of eat is eaten" and "the past participle of write is written". A query for "past participle of eat" embeds near this chunk. What's the failure mode at the reader stage? (Hint: which sentence does the model focus on?)
Sketch a hierarchical (parent-child) chunking pipeline for the grammar-rule KB. What's the child chunk? What's the parent? When does each get used?
For our 50-doc KB, would you use FAISS-flat or HNSW? Defend with timing estimates.

One-paragraph recap¶

Chunking decides what unit of text becomes a single embedding; for our small grammar-rule KB, we use 1-rule chunks (~2-3 sentences) hand-aligned. Vector indexes decide how to find top-k nearest neighbors: flat (exact, cheap for small KBs), IVF (cluster-based, sub-linear), HNSW (graph-based, sub-linear with high recall). For our 50-doc KB, FAISS-flat is sufficient and instant; HNSW is implemented for pedagogical completeness. Critical engineering: lexical anchoring inside chunks improves both BM25 and dense retrieval, especially with the small embedding model we use.

Next: theory/03-hybrid-search-and-reranking.md.