Skip to content

English · Español

Break 00 — Tie input/output embeddings to different vocab sizes

🇪🇸 Rompemos la atadura entre embedding de entrada (V_in = 512) y la cabeza LM de salida (V_out = 256). Phase 17 va a "atar" estas dos matrices (W_out = E.T) — pero si tienen tamaños distintos, falla en runtime con shape mismatch. Es un bug clásico que aparece al fusionar tokenizadores.

Anchors: LYNX_CORTEX.md §4 / PHASE 13; theory §01 embedding as lookup; Phase 17 §03 tied embeddings; .claude/commands/break.md.


The break

In src/minimodel/nn/embedding.py (where the §A13 embedding lives) and src/minimodel/nn/lm_head.py:

# embedding.py
class Embedding(Module):
    def __init__(self, vocab_size: int, d_model: int) -> None:
        super().__init__()
        self.vocab_size = vocab_size
        self.weight = Parameter(np.random.uniform(...,(vocab_size, d_model)))
    ...

# lm_head.py — the "tied" version:
class LMHead(Module):
    def __init__(self, embedding: Embedding, vocab_size_out: int) -> None:
        # BUG: vocab_size_out is wrong — should be the same as embedding.vocab_size.
        super().__init__()
        self.embedding = embedding
        self.vocab_size_out = vocab_size_out  # e.g., 256 instead of 512

    def forward(self, h: Tensor) -> Tensor:
        # Untied case would be: logits = h @ W_lm.T where W_lm is (V_out, d).
        # Tied case: logits = h @ self.embedding.weight.T  → shape (B, T, V_in).
        # But the rest of the pipeline expects (B, T, V_out).
        logits = h @ self.embedding.weight.transpose((1, 0))
        return logits  # shape (B, T, V_in) ≠ (B, T, V_out) — but who is checking?

The bug is the mismatch between embedding.vocab_size = 512 and lm_head.vocab_size_out = 256. If you tie them naively, the logits have shape (B, T, 512) while the cross-entropy expects (B, T, 256).

Predict, then run

Predictions

  1. Runtime crash at the cross-entropy call:
    ValueError: shape mismatch — logits (B, T, 512) vs targets (B, T) with 256 classes
    
  2. If you silence the check (e.g., naive index lookup against vocab_size_out), the model trains but its first 256 vocab slots become a chimera — partially the embedding's first 256 rows, partially the LM head's (256, d_model) matrix.
  3. Loss curve: starts at log(256) = 5.55 (random baseline for 256 classes) and stays there or grows.

Write predictions in learners/borja/phase-13/notes/breaks.md before running.

Observe

just exp 13-train-cbow --tag broken-tied-mismatch

Diagnostics:

  1. Look at the traceback at first forward pass. The shape mismatch should be loud.
  2. If the mismatch is silenced (a related bug), check logits.shape == (B, T, V_out) explicitly.
  3. The loss should be log(V_out) — verify with np.log(V_out) ≈ 5.55 for V_out = 256.

Symptom Borja will see

  • ValueError or RuntimeError at the cross-entropy call.
  • Traceback points to either embedding .weight.T or to the loss function's shape check.
  • If silenced: loss starts at log(V_out) and stays there.

Hidden cause (one sentence)

The LM head reuses the embedding matrix (tied) but the embedding's vocab size is 512 while the LM head's expected output size is 256 — shape mismatch in the tied path.

Hint cascade

  1. What is the shape of embedding.weight, and what shape does h @ embedding.weight.T produce?
  2. What shape does the cross-entropy loss expect from logits?
  3. Compare embedding.vocab_size to the value passed to LMHead(vocab_size_out=...).

Fix diff

class LMHead(Module):
    def __init__(self, embedding: Embedding) -> None:
        super().__init__()
        self.embedding = embedding
        self.vocab_size_out = embedding.vocab_size   # take from the source

(In Phase 17's mini-GPT we'll see this codified as model.lm_head = TiedHead(model.token_emb).)

Why this teaches the concept

Weight tying (Press & Wolf 2017, "Using the Output Embedding to Improve Language Models", and Inan et al. 2017) is a standard transformer trick that ties W_lm = E.T to halve the parameter count and improve perplexity by ~5%. It requires V_in == V_out. The bug — passing inconsistent vocab sizes — is the most common way the tying breaks in practice (a refactor changes the tokenizer's vocab but not the LM head's). Phase 17's mini-GPT lab will use tied embeddings; this break ensures Borja has the shape-discipline reflex before Phase 17.


Next: Phase 14's /break on the LSTM gate sigmoid.