English · Español

03 — Three engineered failures: anatomy of what each looks like¶

🇪🇸 Tres bugs intencionales. Cada uno tiene una firma visual específica en el dashboard. Aprenderlas aquí, en frío, hace que en Phase 26 o 28 los reconozcas en caliente.

The pedagogical setup¶

experiments/19-break-it/ contains three sub-experiments. Each takes the Phase-18 healthy config and corrupts it in exactly one way. Borja:

Runs all three. Each produces a dashboard.html.
Examines the dashboards. For each, writes a diagnosis in borja-diagnoses.md before peeking at which break was applied (the directory names 01, 02, 03 are deliberately uninformative).
Reveals the breaks (solutions/03-three-failures-ref.md lists which is which).
Scores 3/3, ⅔, or ⅓.

This page documents what each break should look like, so the lab is interpretable. Read it after writing the diagnoses, not before.

Failure 1: Bad initialization (Xavier × 100)¶

The corruption¶

Multiply all nn.Linear and nn.Embedding initial weights by 100×. Mathematically: instead of \(W \sim \mathcal{N}(0, \sigma_\text{Xavier}^2)\), use \(W \sim \mathcal{N}(0, (100 \sigma_\text{Xavier})^2)\).

What happens internally¶

Forward pass: with input magnitude \(\|x\| \sim O(1)\) and \(W\) inflated 100×, the first layer's output has magnitude \(\sim 100\). Through the two-layer stack (each layer with normalization, but normalization can only compensate for scale, not magnitude of the un-normalized residual stream), magnitudes compound. By the LM head, pre-softmax logits are in the range \(\pm 10^3\). softmax of such logits is one-hot; log of \(\sim 0\) in the softmax denominator → \(\log(0) = -\infty\) → NaN.

Time to NaN: typically 5-30 steps.

Dashboard signature¶

Panel 4 (activations) is the first to scream. Layer 0's activation magnitude is 50-200× the healthy baseline from step 0. The final LN output is in the hundreds. Bright red, immediate.
Panel 3 (grad norm pre-clip) follows. Gradients of loss w.r.t. weights are proportional to activation magnitudes squared (via the chain rule); pre-clip norm hits \(10^4\) to \(10^6\) within 5 steps. Post-clip pegs at the clip threshold (1.0).
Panel 1 (loss) goes to NaN within 10-30 steps. After that point, all subsequent panels show NaN; the run effectively dies.

Diagnostic hint¶

If panel 4 shows the layer-0 activation magnitude already far above healthy at step 0 (before any training has happened), the init is wrong. Healthy init produces activations of the same order at step 0 as at step 100; the very first forward pass already encodes whether init is sane.

Why this is a real-world bug¶

Real-world cases: someone copy-pastes a Kaiming init that assumes ReLU into a stack that uses GELU and forgets to retune the variance. Or somebody implements Xavier with 1/fan_in instead of 2/(fan_in + fan_out). The 100× exaggeration here is to make the lesson unambiguous; the real-world versions are 2-5× and produce a subtler version of the same dashboard.

Failure 2: No warmup (warmup_steps = 0)¶

The corruption¶

Set warmup = 0 in the schedule. The first optimizer step at global_step = 0 uses LR = lr_max directly (or whatever cosine_schedule(0, T, 0, lr_max, lr_min) evaluates to — usually lr_max).

What happens internally¶

At step 0, AdamW's \(m_0, v_0\) are uninitialized (zero). The first gradient \(g_0\) from random-init parameters has variance that's not yet well-approximated by \(v_0\), so \(\hat v_0\) is wildly biased. The update step:

\[\theta_1 = \theta_0 - \eta_\text{max} \cdot \frac{\hat m_0}{\sqrt{\hat v_0} + \epsilon}\]

When \(\hat v_0\) is underestimated (because \(v_0 = 0\) and the bias correction divides by \((1 - 0.95)\)), the divisor is too small, the step is too large. The model jumps to a bad region of parameter space.

Typically, the loss drops slightly (the step did improve), spikes hard at step 5-20 (the bad region's gradients are wild), then recovers slowly over ~200 steps.

Dashboard signature¶

Panel 3 (grad norm pre-clip) spikes at step 1-5. Often 10-50× the healthy peak. Post-clip pegs at 1.0 for a few steps.
Panel 1 (loss curve) shows a V-shape early. Loss drops, spikes, drops, normalizes. The spike is small in absolute terms but visible on the log axis.
Panel 2 (LR) is the smoking gun. It starts at lr_max instead of 0. If you don't think to look at panel 2, you might attribute the issue to bad data or bad init; panel 2 makes the diagnosis immediate.
Panels 4, 5, 6 show transient excursions but settle into healthy regions by step 200.

Diagnostic hint¶

The combination of "Panel 3 spike at step 1-5" + "Panel 2 starts at lr_max" is uniquely indicative of missing warmup. Bad-init produces persistent magnitude excursions in Panel 4; broken-mask produces no spike in Panel 3 at all (the gradients are healthy for the easy task).

Why this is a real-world bug¶

Real-world cases: someone using a schedule library that doesn't have warmup by default. Or warmup_steps is configured but the variable is read after the optimizer is built (so it uses the default = 0). The dashboard's Panel 2 catches this in one glance.

Failure 3: Broken causal mask (mask = ones, not tril)¶

The corruption¶

Replace the lower-triangular causal mask with an all-true mask:

# Healthy:
causal_mask = np.tril(np.ones((L, L), dtype=bool))

# Broken:
causal_mask = np.ones((L, L), dtype=bool)

Each token can now attend to future tokens. Training becomes an easier task: predict \(y_l\) from \(x_0, x_1, \ldots, x_L\) (including \(x_{l+1} = y_l\) in the input).

What happens internally¶

At training time, the model "cheats" — at position \(l\) the input itself contains \(x_{l+1} = y_l\). The model learns to copy from the future, which is trivial. Training loss drops to near 0 within 100-200 steps.

At validation time, the same broken mask is in use. Val loss also looks low. The mask is broken in both train and val passes — so the bug is consistent, just wrong.

But: the model has not learned to predict from the past. It has learned identity-mapping. When deployed for generation (Phase 21), it generates garbage — at inference time, the future tokens don't exist, the mask is materially different, and the model has no idea what to do.

Dashboard signature (subtle!)¶

Panel 1 (loss curves) shows train approaching 0 fast. Within 100 steps, training loss is below 0.1. If your model has ~103k params and the corpus has hundreds of examples, the model can't legitimately memorize the corpus that fast — unless it's solving a trivial task.
Val loss tracks train loss closely (both approaching 0). This is the suspicious part. Real overfitting shows train ≪ val; the broken mask shows train ≈ val, both very small. The model isn't overfitting to the training set; it's solving an easier task that happens to apply equally well to val.
Panel 6 (dead heads) often shows high dead-head counts. Why: when the mask allows future-attention, many heads can be trivially solved by "attend to position +1, copy." Only one head needs to do this; the others are redundant and effectively die.
Panels 2, 3, 4, 5 look mostly normal. This is the trap — five of six panels look fine.

Diagnostic hint¶

The signature is panel 1 + panel 6 together: loss too low, dead-head count abnormally high. Either alone is ambiguous. Together, "either the model is too good, or it's solving the wrong task," and "many heads are redundant," is the broken-mask diagnosis.

The other heuristic: if loss seems too good to be true, it is. Compute the entropy of the corpus. The model's loss can't go below that entropy unless something is wrong.

Why this is a real-world bug¶

Real-world cases: someone enabling KV-cache logic that assumes a different mask convention. Or an attention implementation where the mask is applied at the wrong place (additive -inf vs multiplicative 0). This is one of the most common attention bugs in published code; the curriculum makes it a Phase-19 lesson because diagnosing it later, in Phase 21 or beyond, is much more expensive.

How the three break dashboards differ at a glance¶

Panel	Healthy	Failure 1 (init)	Failure 2 (warmup)	Failure 3 (mask)
1 Loss	Smooth descent	NaN by step 30	V-shape at step 5	Approaches 0 (suspiciously)
2 LR	Warmup then cosine	Doesn't matter (NaN'd)	Starts at lr_max	Healthy
3 Grad norm	Stable, sometimes clipped	Explodes immediately	Spikes at step 1-5	Healthy
4 Activations	Stable	Already huge at step 0	Transient spikes	Healthy
5 Spectral	Stable	Explodes	Brief excursion	Healthy
6 Dead	~5%	Effectively all (post-NaN)	Healthy after recovery	High in attention

The diagnostic algorithm is: which panel screams first, and what shape is the scream.

One-paragraph recap¶

Three engineered breaks each produce a unique dashboard signature. Bad init: Panel 4 already shows enormous activation magnitudes at step 0; loss NaNs within 30 steps. No warmup: Panel 2 starts at lr_max instead of 0, Panel 3 spikes at step 1-5, loss recovers. Broken mask: Panels 1 and 6 together — loss is suspiciously low and dead-head count is unusually high. Internalize these three signatures and they protect you across all the future training phases of the curriculum.

Next: lab/00-instrument-hooks.md.