English · Español

02 — The FFN sublayer and why GELU¶

🇪🇸 Si attention es comunicación entre tokens, FFN es computación dentro de un token. Dos capas lineales, una nolinealidad, factor de expansión \(4\times\). Esa proporción no es arbitraria — es donde la mayoría de los parámetros viven.

The FFN — exact form¶

The position-wise feed-forward network is a two-layer MLP applied independently to each token position (no cross-position mixing):

\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2\]

Shapes:

\(x \in \mathbb{R}^{d_\text{model}}\) (the residual stream at one position)
\(W_1 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}}\), \(b_1 \in \mathbb{R}^{d_\text{ff}}\) — the "up-projection"
\(W_2 \in \mathbb{R}^{d_\text{model} \times d_\text{ff}}\), \(b_2 \in \mathbb{R}^{d_\text{model}}\) — the "down-projection"
Output is back in \(\mathbb{R}^{d_\text{model}}\), ready to be added to the residual stream

Three operations: up-project, nonlinearity, down-project. That's it.

The \(4\times\) expansion ratio¶

Canonical choice: \(d_\text{ff} = 4 \cdot d_\text{model}\). For our Mini-GPT: \(d_\text{model} = 64, d_\text{ff} = 256\).

Where does this number come from? Vaswani et al. 2017 used it without strong justification; subsequent ablations (e.g., GPT-3, PaLM) confirmed \(4\times\) as a good default. The intuition: the FFN is the "compute heavy" part of the block, so giving it more capacity than the residual width pays off. Going to \(8\times\) or \(16\times\) marginally helps but at quadratic parameter cost; \(2\times\) underfits. The community settled on \(4\times\).

For very large models (e.g., LLaMA), the SwiGLU variant uses \(\sim 2.67 \times d_\text{model}\) for \(W_1\) and a gating projection, so the effective parameter count matches \(4\times d_\text{model}\) with a vanilla GELU FFN. Phase 17 uses the vanilla GELU form; SwiGLU is a one-page footnote we can add later if needed.

FFN parameter count¶

For one FFN layer:

\[|\theta_\text{FFN}| = d_\text{ff} \cdot d_\text{model} + d_\text{ff} + d_\text{model} \cdot d_\text{ff} + d_\text{model} = 2 d_\text{model} d_\text{ff} + d_\text{ff} + d_\text{model}\]

For our config:

\[|\theta_\text{FFN}| = 2 \cdot 64 \cdot 256 + 256 + 64 = 32{,}768 + 256 + 64 = 33{,}088\]

Compare to one block's attention parameters (Q, K, V, output projections, each \(d_\text{model} \times d_\text{model}\), no bias by convention):

\[|\theta_\text{attn}| = 4 \cdot d_\text{model}^2 = 4 \cdot 64^2 = 16{,}384\]

So FFN is roughly 2× attention's parameter count. This 2:1 ratio is consistent across modern transformers (GPT-2, GPT-3, LLaMA, etc.). When you hear "transformer parameters are mostly FFN," now you know what they mean.

GELU — the Gaussian Error Linear Unit¶

Introduced by Hendrycks & Gimpel (2016). Defined as:

\[\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]\]

where \(\Phi\) is the standard Gaussian CDF. Intuitively: "multiply \(x\) by the probability that a \(\mathcal{N}(0, 1)\) sample is less than \(x\)." So GELU is a smooth, probabilistically motivated gate — a soft version of ReLU.

The approximate form, used in production:

\[\text{GELU}(x) \approx 0.5 x \left(1 + \tanh\left[\sqrt{2/\pi}\,(x + 0.044715 \, x^3)\right]\right)\]

Difference from the exact form: < 0.01% in output values, but the approximate form is much faster (no erf, just tanh). Use the approximate form. PyTorch's F.gelu(approximate='tanh') gates this; our NumPy default is the approximate form.

def gelu(x):
    return 0.5 * x * (1.0 + np.tanh(np.sqrt(2.0 / np.pi) * (x + 0.044715 * x**3)))

Why GELU and not ReLU?¶

ReLU is hard at 0: zero gradient for any \(x < 0\), which can cause "dead neurons" that stop updating. GELU is smooth everywhere: it has nonzero gradient for all \(x\), including negative inputs (small but nonzero). Empirically, GELU consistently beats ReLU on language modelling by a small but real margin (~1-2% perplexity).

A short pedigree:

ReLU (Glorot et al. 2011) — zero for \(x < 0\), identity for \(x > 0\). Cheap and simple.
GELU (Hendrycks & Gimpel 2016) — smooth, probabilistic. Default in BERT, GPT-2, GPT-3.
Swish / SiLU (Ramachandran et al. 2017) — \(x \cdot \sigma(x)\). Very similar to GELU in practice.
SwiGLU (Shazeer 2020) — gated variant: \(\text{SwiGLU}(x) = (x W_1) \odot \text{Swish}(x W_3) \cdot W_2\). Default in LLaMA, PaLM.

Phase 17 uses GELU because it's what GPT-2 used and that's the cleanest mental model. SwiGLU is a parameter-cost-equivalent upgrade you'd do in a real project; here it would just be a swap.

Why FFN exists — the deeper reason¶

A common question: "Attention can mix arbitrary information across tokens. Why do we need FFN too?"

Two angles:

Linearity. Attention's softmax produces convex mixtures of input vectors — a linear combination. Composition of linear ops (attention layers without FFN) collapses into a single linear operator (modulo the softmax nonlinearity, which is mild). To get real nonlinear function approximation, you need the FFN's GELU.
Pointwise vs across-position. Attention's expressive power is in mixing positions. FFN's expressive power is in transforming individual positions. They're orthogonal. A transformer block needs both because language understanding needs both: "what is the role of token 5 in context?" (attention) and "given token 5 is in this role, what should I compute about it?" (FFN).

Mechanistic interpretability work (Anthropic 2022, "Toy Models of Superposition") frames FFN as the key-value memory: \(W_1\) is the keys (which input pattern triggers this neuron), \(W_2\) is the values (what to write to the residual stream when this neuron fires). This is a beautiful and useful perspective for later phases.

What this file does NOT cover¶

The FFN's backward pass. Implicit from the autograd built in Phase 8; Phase 18 will exercise it.
SwiGLU implementation. Footnote upgrade path; not in Phase 17 scope.
FFN at very wide scales (LLaMA-class). Same math; bigger numbers. Out of scope.

Next: 03-tied-embeddings-and-lm-head.md