English · Español
Break 00 — Tie input/output embeddings to different vocab sizes¶
🇪🇸 Rompemos la atadura entre embedding de entrada (V_in = 512) y la cabeza LM de salida (V_out = 256). Phase 17 va a "atar" estas dos matrices (
W_out = E.T) — pero si tienen tamaños distintos, falla en runtime con shape mismatch. Es un bug clásico que aparece al fusionar tokenizadores.Anchors:
LYNX_CORTEX.md§4 / PHASE 13; theory §01 embedding as lookup; Phase 17 §03 tied embeddings;.claude/commands/break.md.
The break¶
In src/minimodel/nn/embedding.py (where the §A13 embedding lives) and src/minimodel/nn/lm_head.py:
# embedding.py
class Embedding(Module):
def __init__(self, vocab_size: int, d_model: int) -> None:
super().__init__()
self.vocab_size = vocab_size
self.weight = Parameter(np.random.uniform(...,(vocab_size, d_model)))
...
# lm_head.py — the "tied" version:
class LMHead(Module):
def __init__(self, embedding: Embedding, vocab_size_out: int) -> None:
# BUG: vocab_size_out is wrong — should be the same as embedding.vocab_size.
super().__init__()
self.embedding = embedding
self.vocab_size_out = vocab_size_out # e.g., 256 instead of 512
def forward(self, h: Tensor) -> Tensor:
# Untied case would be: logits = h @ W_lm.T where W_lm is (V_out, d).
# Tied case: logits = h @ self.embedding.weight.T → shape (B, T, V_in).
# But the rest of the pipeline expects (B, T, V_out).
logits = h @ self.embedding.weight.transpose((1, 0))
return logits # shape (B, T, V_in) ≠ (B, T, V_out) — but who is checking?
The bug is the mismatch between embedding.vocab_size = 512 and lm_head.vocab_size_out = 256. If you tie them naively, the logits have shape (B, T, 512) while the cross-entropy expects (B, T, 256).
Predict, then run¶
Predictions¶
- Runtime crash at the cross-entropy call:
- If you silence the check (e.g., naive index lookup against
vocab_size_out), the model trains but its first 256 vocab slots become a chimera — partially the embedding's first 256 rows, partially the LM head's(256, d_model)matrix. - Loss curve: starts at
log(256) = 5.55(random baseline for 256 classes) and stays there or grows.
Write predictions in learners/borja/phase-13/notes/breaks.md before running.
Observe¶
Diagnostics:
- Look at the traceback at first forward pass. The shape mismatch should be loud.
- If the mismatch is silenced (a related bug), check
logits.shape == (B, T, V_out)explicitly. - The loss should be
log(V_out)— verify withnp.log(V_out)≈ 5.55 forV_out = 256.
Symptom Borja will see¶
ValueErrororRuntimeErrorat the cross-entropy call.- Traceback points to either embedding
.weight.Tor to the loss function's shape check. - If silenced: loss starts at
log(V_out)and stays there.
Hidden cause (one sentence)¶
The LM head reuses the embedding matrix (tied) but the embedding's vocab size is 512 while the LM head's expected output size is 256 — shape mismatch in the tied path.
Hint cascade¶
- What is the shape of
embedding.weight, and what shape doesh @ embedding.weight.Tproduce? - What shape does the cross-entropy loss expect from
logits? - Compare
embedding.vocab_sizeto the value passed toLMHead(vocab_size_out=...).
Fix diff¶
class LMHead(Module):
def __init__(self, embedding: Embedding) -> None:
super().__init__()
self.embedding = embedding
self.vocab_size_out = embedding.vocab_size # take from the source
(In Phase 17's mini-GPT we'll see this codified as model.lm_head = TiedHead(model.token_emb).)
Why this teaches the concept¶
Weight tying (Press & Wolf 2017, "Using the Output Embedding to Improve Language Models", and Inan et al. 2017) is a standard transformer trick that ties W_lm = E.T to halve the parameter count and improve perplexity by ~5%. It requires V_in == V_out. The bug — passing inconsistent vocab sizes — is the most common way the tying breaks in practice (a refactor changes the tokenizer's vocab but not the LM head's). Phase 17's mini-GPT lab will use tied embeddings; this break ensures Borja has the shape-discipline reflex before Phase 17.
Next: Phase 14's /break on the LSTM gate sigmoid.