English · Español
01 — Embeddings and Bi-Encoders¶
🇪🇸 Un bi-encoder es un modelo que mapea texto → vector. Si dos textos son similares semánticamente, sus vectores deben tener un producto-punto alto. Esta sección deriva cómo se entrena (contrastive loss), por qué los cross-encoders son más precisos pero más caros, y cuándo elegir cada uno.
Bi-encoders: the basic idea¶
A bi-encoder is a neural network E: text → ℝ^d that maps any text (a sentence, a paragraph, a chunk) to a fixed-dimensional vector. The promise:
||E(text)||₂is bounded (typically unit-normalized).- Semantic similarity ≈ vector similarity. If
text_Aandtext_Bmean similar things, thenE(text_A) · E(text_B) ≈ 1. If they mean unrelated things,E(text_A) · E(text_B) ≈ 0.
The retrieval pipeline becomes:
- Offline: for each KB document
d, computeE(d)and store it. - Online: for query
q, computeE(q)and find the top-k documents byE(q) · E(d_i).
The "bi-" in bi-encoder refers to two independent encodings: E(q) and E(d) are computed independently, then compared. Compare to cross-encoders below.
How bi-encoders are trained: contrastive loss¶
A bi-encoder is typically a transformer. The training objective: contrast positive pairs against negative pairs.
Given:
- Positive pair (q, d^+): query q and a known-relevant document d^+.
- Negative pairs (q, d^-_1), …, (q, d^-_k): same query with irrelevant documents.
Loss: InfoNCE (contrastive):
τ is a temperature, typically 0.05–0.1. The loss pushes the positive pair's dot product up and the negative pairs' dot products down.
Mechanically identical to softmax classification: the positive pair is the "correct class", and negative pairs are "wrong classes". The gradient flows through both E(q) and E(d) — the same encoder, applied twice.
In-batch negatives¶
For efficiency, training uses in-batch negatives: in a batch of B (query, positive) pairs, the other B-1 positives serve as negatives. This avoids needing a separate negative-sampling step.
For us (Phase 29) this is conceptual — we use a pretrained sentence-transformers/all-MiniLM-L6-v2 (22M params, English) and don't train our own bi-encoder. But understanding the loss explains why the embedding space behaves the way it does.
Why bi-encoders work for retrieval¶
Two reasons:
- Vectorizable. Once trained,
E(d)for all KB documents can be precomputed once. Query time: one embedding pass + a dot-product over the KB. Scales to billions of docs. - Distillation. Modern bi-encoders are distilled from cross-encoders or LLM-judges. The teacher provides high-quality (query, doc, label) triples; the bi-encoder learns to approximate the teacher's similarity scores in the embedding space.
Sentence-transformers' all-MiniLM-L6-v2 is the canonical example: 22M params, distilled from MS-MARCO and several QA datasets. Embedding dimension 384. CPU-fast (~1000 sentences/sec).
Cross-encoders: the heavyweight alternative¶
A cross-encoder takes a (query, doc) pair and produces a single relevance score:
Mechanically: concatenate q and d with a [SEP] token, run through a transformer, take the [CLS] output, project to scalar.
- More accurate than bi-encoders. The query and doc tokens can attend to each other; the model can resolve "which doc actually answers this query" with full attention.
- Much more expensive. Per query: one transformer forward per candidate. For top-1000 candidates: 1000 forward passes. Not vectorizable across docs.
So you can't use a cross-encoder directly on a large corpus. You use it as a reranker — first retrieve top-k with a bi-encoder, then rerank top-k with a cross-encoder.
The two-stage pipeline¶
Standard production RAG:
- Recall stage: bi-encoder retrieves top-100 from the full KB. Cheap.
- Precision stage: cross-encoder reranks the top-100. Expensive but bounded.
The bi-encoder optimizes for recall: we want the gold doc in the top-100. The cross-encoder optimizes for precision: we want the top-1 to be the gold doc.
For Phase 29's tiny KB (~50 docs), the bi-encoder alone covers the full KB at top-50. The cross-encoder reranker still helps precision at the top — but the marginal value is smaller. Lab 03 measures.
Embedding the query vs embedding the doc¶
A subtle point: most bi-encoders use the same encoder for both query and doc (asymmetric usage of the same model). Sentence-transformers' all-MiniLM-L6-v2 is one of these.
Some bi-encoders use two separate encoders (called "asymmetric" bi-encoders): one tuned for short queries, one for longer documents. ColBERT (Khattab & Zaharia, 2020) uses one encoder but with multi-vector representations (per-token rather than per-sentence). These are more accurate but more memory-intensive.
For Phase 29 we use the standard same-encoder approach. The KB documents are short (1-3 sentences each), comparable in length to queries.
Mean-pooling vs [CLS] vs other¶
A bi-encoder's encoder produces a sequence of token embeddings h_1, …, h_n. The final fixed-dim vector is some pooling of these:
[CLS]pooling: use the[CLS]token's embedding. Standard in BERT-derived models.- Mean pooling: average all token embeddings (typically excluding padding). The pretrained sentence-transformers models use this.
- Max pooling, attention pooling, etc.: less common.
Sentence-transformers' family uses mean-pooling. We follow.
From-scratch alternative: MiniGPT mean-pool¶
For pedagogical purposes, lab 01 implements a from-scratch bi-encoder using MiniGPT's embedding table:
This is much worse than all-MiniLM-L6-v2:
- The MiniGPT embedding table wasn't trained for sentence similarity. It was trained for next-token prediction.
- The mean-pool ignores position and context.
- The embedding dimension is tiny (64).
But it's free, it requires no external model, and it illustrates the abstract concept of "text → vector". Lab 01 compares the two; sentence-transformers wins by ~30 percentage points hit-rate@5.
Normalization¶
For cosine similarity, embeddings must be unit-normalized: ||E(x)||₂ = 1. Sentence-transformers does this by default.
For raw dot-product similarity, unit-normalization makes dot product equal to cosine — useful for FAISS-style indexes that operate on dot product. Always normalize after pooling; cheap.
The embedding-space geometry¶
Trained bi-encoders produce embedding spaces with the property:
- Topical clusters: docs about the same topic cluster together.
- Within-cluster geometry: similar-sentence-structure docs are close; different-structure docs are far.
- Anisotropy: pretrained transformers tend to produce embeddings that occupy a narrow cone in
ℝ^d. Sentence-transformers training partially fixes this with contrastive loss + whitening.
For our Phase 29 KB, queries like "past participle of eat?" should cluster near rule chunks containing "eat" and "participle". Cross-cluster confusion is the failure mode — observed in lab 01 when the corpus contains near-duplicate rules (e.g., for write and wrote).
Caching considerations¶
The KB-doc embeddings are computed once and cached. The query embedding is computed per query — but query embedding is fast (one forward pass, ~50ms on CPU). FAISS-flat search over 50 docs is ~1ms. Total retrieval latency: <100ms. Reranker adds another ~200ms for 10 candidates.
For Phase 29's tiny KB this is fine. For 10M-doc corpora, FAISS-HNSW gets you to <50ms latency.
Why we use a pretrained model¶
Could we train a bi-encoder from scratch on the verb-grammar corpus? Yes. Should we? No:
- The corpus is tiny (~500 sentences). Not enough to train a bi-encoder from scratch.
- The contrastive-loss training requires negative examples; constructing meaningful negatives for a small domain is hard.
all-MiniLM-L6-v2was distilled from huge datasets. The downstream gain from in-domain fine-tuning is small for our task.
We use the pretrained model as-is. Mentioned in §A8 (the framework / tooling list of the addendum).
Drill problems¶
Solutions at phase open in solutions/01-embeddings-and-biencoders-ref.md.
- The bi-encoder produces a 384-dim embedding for each chunk. For 50 chunks, the KB embedding store occupies how many bytes (fp32)? In MiB?
- Argue why mean-pooling is better than
[CLS]pooling for sentence-level tasks. (Hint: think about which tokens carry meaning.) - Construct a pair of grammar-rule sentences that are semantically equivalent but lexically different. Which sentence-similarity should be high; what would a poor bi-encoder do?
- Cross-encoders score
(q, d)pairs by attention. Bi-encoders embedqanddindependently. Argue what the cross-encoder can "see" that a bi-encoder cannot. (Hint: token-level interactions.) - The InfoNCE loss has a temperature
τ. Asτ → 0, the loss becomes nearly uniform on positives and negatives. Asτ → ∞, the loss saturates. What's the qualitative effect ofτ = 0.05vsτ = 0.5?
One-paragraph recap¶
A bi-encoder is a transformer that maps text to a fixed-dim vector with the property that semantically similar texts produce vectors with high dot product. Training uses InfoNCE contrastive loss with in-batch negatives. Once trained, the bi-encoder enables vectorized retrieval: precompute all KB-doc embeddings once, dot product against a fresh query embedding per query. Cross-encoders are more accurate but require one forward pass per (q, d) pair, so they're used only for reranking top-k from a bi-encoder. Phase 29 uses a pretrained all-MiniLM-L6-v2 for production and a from-scratch MiniGPT mean-pool for pedagogical contrast. The latter is much worse but illustrates the concept.
Next: theory/02-chunking-and-indexes.md.