English · Español
03 — Tied embeddings and the LM head¶
🇪🇸 La cabeza del modelo lingüístico reutiliza la misma matriz que el embedding de entrada. Un truco de un solo carácter en el código que ahorra \(|V| \cdot d_\text{model}\) parámetros y unifica la entrada y salida en un mismo espacio. Aquí no salvamos mucha memoria — el corpus es pequeño — pero entendemos el principio.
The LM head¶
After the final LayerNorm, the residual stream sits in \(\mathbb{R}^{T \times d_\text{model}}\). To turn this into a probability distribution over vocabulary tokens at each position, we apply a linear projection to vocab size, then softmax:
where \(W_\text{LM} \in \mathbb{R}^{|V| \times d_\text{model}}\). Naively, this is a new learnable matrix with \(|V| \cdot d_\text{model}\) params.
But notice: the input embedding \(E \in \mathbb{R}^{|V| \times d_\text{model}}\) has the same shape. A vector "embeds" a token id; the transpose "unembeds" a residual back to a token id. The two operations are inverses of each other in a meaningful sense.
Weight tying¶
Tied embeddings (Press & Wolf 2017, "Using the Output Embedding to Improve Language Models"; concurrent in Inan et al. 2017): set \(W_\text{LM} = E\). The input embedding matrix is the output projection matrix. The forward becomes:
That's it — same matrix, used twice. Pattern:
class MiniGPT:
def __init__(self, vocab_size, d_model, ...):
self.E = Parameter(np.random.randn(vocab_size, d_model) * 0.02)
# ... blocks, LNs, etc.
# NO self.W_LM. The LM head reuses self.E.
def forward(self, tokens):
h = self.E[tokens] # (T, d_model) — token embed
for block in self.blocks:
h = block(h)
h = self.ln_final(h)
logits = h @ self.E.T # (T, vocab_size) — LM head, tied
return logits
Two benefits¶
1. Parameter savings¶
Tying eliminates one \(|V| \cdot d_\text{model}\) matrix. For Mini-GPT (\(|V| = 64, d_\text{model} = 64\)), that's 4096 params — small. But for GPT-2 (\(|V| = 50257, d_\text{model} = 768\)), it's 38.5M params — out of 124M total, ~31% of the model. At LLaMA-2 scale (\(|V| = 32000, d_\text{model} = 4096\)), it's 131M params. The savings are substantial at real scale.
For Mini-GPT, the savings are pedagogical, not practical. We tie anyway because:
- It teaches the principle.
- It makes the embedding matrix's direction meaningful — directions that improve the input embedding also improve the output projection.
- It is what every modern transformer does.
2. Conceptual symmetry¶
Tying says: "the same notion of what a token is (the embedding) determines what a hidden state means about a token (the unembedding)." There is one vocabulary space, used twice. This is satisfying conceptually and turns out to be true empirically — independently-trained input and output embeddings end up close to each other anyway. Tying just imposes that prior.
Mechanistic interpretability rests on this: the unembed direction for token \(w\) is the row \(E[w]\), and you can ask "which residual directions point at \(E[w]\)?" — this is the "logit lens" technique (nostalgebraist 2020). Tying makes the lens well-defined.
The full Mini-GPT forward, with tied head¶
tokens (T,) → E[tokens] shape (T, d_model)
↓
block_0 shape (T, d_model)
↓
block_1 shape (T, d_model)
↓
LN_final shape (T, d_model)
↓
@ E.T shape (T, vocab_size)
↓
logits → (softmax in loss; not part of model proper)
Why no softmax inside MiniGPT.forward?¶
The forward returns logits, not probabilities. The softmax happens:
- Inside the loss function (Phase 18) — fused with the loss for numerical stability (the
cross_entropy_from_logitstrick from Phase 05). - Inside sampling (Phase 21) — possibly with temperature scaling.
Decoupling forward from softmax means:
- Numerical stability: the loss can use the log-sum-exp trick instead of computing
log(softmax(...))directly. - Flexibility: at inference, you can apply temperature, top-k, top-p without recomputing.
Mini-GPT's .forward() returns logits: (T, V). Anything downstream chooses what to do with them.
A subtle point: gradient flow with tying¶
When you tie \(W_\text{LM} = E\), the gradient \(\partial \mathcal{L} / \partial E\) has two contributions: one from the input embedding lookup, one from the output projection. Autograd handles this automatically if you wire \(E\) as a single Parameter referenced twice — exactly what the code above does. If you accidentally make two copies (e.g., self.W_LM = E.copy()), you've untied them and lost the property.
Phase 18 will look at the gradient flow into the tied \(E\) as a sanity check; the two contributions sum and the resulting update direction is meaningful.
Initialization¶
For tied embeddings, initialize \(E\) once with the embedding-style init (small Gaussian, typically \(\mathcal{N}(0, 0.02^2)\)). The output projection inherits this init — no separate init needed. This is the GPT-2 default.
For untied LM heads, you'd typically initialize the head with the same scale, since it's effectively another linear layer.
The final softmax — where it lives at inference¶
In Phase 21 (sampling), the model's logits become probabilities via:
where \(\tau\) is the temperature (\(\tau = 1\) is "raw," \(\tau < 1\) sharpens, \(\tau > 1\) flattens). Phase 21 covers temperature, top-k, top-p, nucleus. None of this is part of Phase 17. Phase 17's job ends at logits.
What this file does NOT cover¶
- Sampling strategies. Phase 21.
- The cross-entropy loss using logits. Phase 18 (with the numerical-stability trick from Phase 05).
- Untied LM heads. Mentioned for completeness; not used.