English · Español

01 — Embeddings and Bi-Encoders¶

🇪🇸 Un bi-encoder es un modelo que mapea texto → vector. Si dos textos son similares semánticamente, sus vectores deben tener un producto-punto alto. Esta sección deriva cómo se entrena (contrastive loss), por qué los cross-encoders son más precisos pero más caros, y cuándo elegir cada uno.

Bi-encoders: the basic idea¶

A bi-encoder is a neural network E: text → ℝ^d that maps any text (a sentence, a paragraph, a chunk) to a fixed-dimensional vector. The promise:

||E(text)||₂ is bounded (typically unit-normalized).
Semantic similarity ≈ vector similarity. If text_A and text_B mean similar things, then E(text_A) · E(text_B) ≈ 1. If they mean unrelated things, E(text_A) · E(text_B) ≈ 0.

The retrieval pipeline becomes:

Offline: for each KB document d, compute E(d) and store it.
Online: for query q, compute E(q) and find the top-k documents by E(q) · E(d_i).

The "bi-" in bi-encoder refers to two independent encodings: E(q) and E(d) are computed independently, then compared. Compare to cross-encoders below.

How bi-encoders are trained: contrastive loss¶

A bi-encoder is typically a transformer. The training objective: contrast positive pairs against negative pairs.

Given: - Positive pair (q, d^+): query q and a known-relevant document d^+. - Negative pairs (q, d^-_1), …, (q, d^-_k): same query with irrelevant documents.

Loss: InfoNCE (contrastive):

\[ \mathcal{L}(q, d^+, \{d^-_i\}) = - \log \frac{\exp(E(q) \cdot E(d^+) / \tau)}{\exp(E(q) \cdot E(d^+) / \tau) + \sum_i \exp(E(q) \cdot E(d^-_i) / \tau)} \]

τ is a temperature, typically 0.05–0.1. The loss pushes the positive pair's dot product up and the negative pairs' dot products down.

Mechanically identical to softmax classification: the positive pair is the "correct class", and negative pairs are "wrong classes". The gradient flows through both E(q) and E(d) — the same encoder, applied twice.

In-batch negatives¶

For efficiency, training uses in-batch negatives: in a batch of B (query, positive) pairs, the other B-1 positives serve as negatives. This avoids needing a separate negative-sampling step.

For us (Phase 29) this is conceptual — we use a pretrained sentence-transformers/all-MiniLM-L6-v2 (22M params, English) and don't train our own bi-encoder. But understanding the loss explains why the embedding space behaves the way it does.

Why bi-encoders work for retrieval¶

Two reasons:

Vectorizable. Once trained, E(d) for all KB documents can be precomputed once. Query time: one embedding pass + a dot-product over the KB. Scales to billions of docs.
Distillation. Modern bi-encoders are distilled from cross-encoders or LLM-judges. The teacher provides high-quality (query, doc, label) triples; the bi-encoder learns to approximate the teacher's similarity scores in the embedding space.

Sentence-transformers' all-MiniLM-L6-v2 is the canonical example: 22M params, distilled from MS-MARCO and several QA datasets. Embedding dimension 384. CPU-fast (~1000 sentences/sec).

Cross-encoders: the heavyweight alternative¶

A cross-encoder takes a (query, doc) pair and produces a single relevance score:

\[ C: (q, d) \to \mathbb{R} \]

Mechanically: concatenate q and d with a [SEP] token, run through a transformer, take the [CLS] output, project to scalar.

More accurate than bi-encoders. The query and doc tokens can attend to each other; the model can resolve "which doc actually answers this query" with full attention.
Much more expensive. Per query: one transformer forward per candidate. For top-1000 candidates: 1000 forward passes. Not vectorizable across docs.

So you can't use a cross-encoder directly on a large corpus. You use it as a reranker — first retrieve top-k with a bi-encoder, then rerank top-k with a cross-encoder.

The two-stage pipeline¶

Standard production RAG:

Recall stage: bi-encoder retrieves top-100 from the full KB. Cheap.
Precision stage: cross-encoder reranks the top-100. Expensive but bounded.

The bi-encoder optimizes for recall: we want the gold doc in the top-100. The cross-encoder optimizes for precision: we want the top-1 to be the gold doc.

For Phase 29's tiny KB (~50 docs), the bi-encoder alone covers the full KB at top-50. The cross-encoder reranker still helps precision at the top — but the marginal value is smaller. Lab 03 measures.

Embedding the query vs embedding the doc¶

A subtle point: most bi-encoders use the same encoder for both query and doc (asymmetric usage of the same model). Sentence-transformers' all-MiniLM-L6-v2 is one of these.

Some bi-encoders use two separate encoders (called "asymmetric" bi-encoders): one tuned for short queries, one for longer documents. ColBERT (Khattab & Zaharia, 2020) uses one encoder but with multi-vector representations (per-token rather than per-sentence). These are more accurate but more memory-intensive.

For Phase 29 we use the standard same-encoder approach. The KB documents are short (1-3 sentences each), comparable in length to queries.

Mean-pooling vs `[CLS]` vs other¶

A bi-encoder's encoder produces a sequence of token embeddings h_1, …, h_n. The final fixed-dim vector is some pooling of these:

[CLS] pooling: use the [CLS] token's embedding. Standard in BERT-derived models.
Mean pooling: average all token embeddings (typically excluding padding). The pretrained sentence-transformers models use this.
Max pooling, attention pooling, etc.: less common.

Sentence-transformers' family uses mean-pooling. We follow.

From-scratch alternative: MiniGPT mean-pool¶

For pedagogical purposes, lab 01 implements a from-scratch bi-encoder using MiniGPT's embedding table:

\[ E_{\text{MiniGPT}}(\text{text}) = \frac{1}{|tokens|} \sum_t \text{embed}(t) \]

This is much worse than all-MiniLM-L6-v2:

The MiniGPT embedding table wasn't trained for sentence similarity. It was trained for next-token prediction.
The mean-pool ignores position and context.
The embedding dimension is tiny (64).

But it's free, it requires no external model, and it illustrates the abstract concept of "text → vector". Lab 01 compares the two; sentence-transformers wins by ~30 percentage points hit-rate@5.

Normalization¶

For cosine similarity, embeddings must be unit-normalized: ||E(x)||₂ = 1. Sentence-transformers does this by default.

For raw dot-product similarity, unit-normalization makes dot product equal to cosine — useful for FAISS-style indexes that operate on dot product. Always normalize after pooling; cheap.

The embedding-space geometry¶

Trained bi-encoders produce embedding spaces with the property:

Topical clusters: docs about the same topic cluster together.
Within-cluster geometry: similar-sentence-structure docs are close; different-structure docs are far.
Anisotropy: pretrained transformers tend to produce embeddings that occupy a narrow cone in ℝ^d. Sentence-transformers training partially fixes this with contrastive loss + whitening.

For our Phase 29 KB, queries like "past participle of eat?" should cluster near rule chunks containing "eat" and "participle". Cross-cluster confusion is the failure mode — observed in lab 01 when the corpus contains near-duplicate rules (e.g., for write and wrote).

Caching considerations¶

The KB-doc embeddings are computed once and cached. The query embedding is computed per query — but query embedding is fast (one forward pass, ~50ms on CPU). FAISS-flat search over 50 docs is ~1ms. Total retrieval latency: <100ms. Reranker adds another ~200ms for 10 candidates.

For Phase 29's tiny KB this is fine. For 10M-doc corpora, FAISS-HNSW gets you to <50ms latency.

Why we use a pretrained model¶

Could we train a bi-encoder from scratch on the verb-grammar corpus? Yes. Should we? No:

The corpus is tiny (~500 sentences). Not enough to train a bi-encoder from scratch.
The contrastive-loss training requires negative examples; constructing meaningful negatives for a small domain is hard.
all-MiniLM-L6-v2 was distilled from huge datasets. The downstream gain from in-domain fine-tuning is small for our task.

We use the pretrained model as-is. Mentioned in §A8 (the framework / tooling list of the addendum).

Drill problems¶

Solutions at phase open in solutions/01-embeddings-and-biencoders-ref.md.

The bi-encoder produces a 384-dim embedding for each chunk. For 50 chunks, the KB embedding store occupies how many bytes (fp32)? In MiB?
Argue why mean-pooling is better than [CLS] pooling for sentence-level tasks. (Hint: think about which tokens carry meaning.)
Construct a pair of grammar-rule sentences that are semantically equivalent but lexically different. Which sentence-similarity should be high; what would a poor bi-encoder do?
Cross-encoders score (q, d) pairs by attention. Bi-encoders embed q and d independently. Argue what the cross-encoder can "see" that a bi-encoder cannot. (Hint: token-level interactions.)
The InfoNCE loss has a temperature τ. As τ → 0, the loss becomes nearly uniform on positives and negatives. As τ → ∞, the loss saturates. What's the qualitative effect of τ = 0.05 vs τ = 0.5?

One-paragraph recap¶

A bi-encoder is a transformer that maps text to a fixed-dim vector with the property that semantically similar texts produce vectors with high dot product. Training uses InfoNCE contrastive loss with in-batch negatives. Once trained, the bi-encoder enables vectorized retrieval: precompute all KB-doc embeddings once, dot product against a fresh query embedding per query. Cross-encoders are more accurate but require one forward pass per (q, d) pair, so they're used only for reranking top-k from a bi-encoder. Phase 29 uses a pretrained all-MiniLM-L6-v2 for production and a from-scratch MiniGPT mean-pool for pedagogical contrast. The latter is much worse but illustrates the concept.

Next: theory/02-chunking-and-indexes.md.