English · Español
03 — Three engineered failures: anatomy of what each looks like¶
🇪🇸 Tres bugs intencionales. Cada uno tiene una firma visual específica en el dashboard. Aprenderlas aquí, en frío, hace que en Phase 26 o 28 los reconozcas en caliente.
The pedagogical setup¶
experiments/19-break-it/ contains three sub-experiments. Each takes the Phase-18 healthy config and corrupts it in exactly one way. Borja:
- Runs all three. Each produces a
dashboard.html. - Examines the dashboards. For each, writes a diagnosis in
borja-diagnoses.mdbefore peeking at which break was applied (the directory names01,02,03are deliberately uninformative). - Reveals the breaks (
solutions/03-three-failures-ref.mdlists which is which). - Scores 3/3, ⅔, or ⅓.
This page documents what each break should look like, so the lab is interpretable. Read it after writing the diagnoses, not before.
Failure 1: Bad initialization (Xavier × 100)¶
The corruption¶
Multiply all nn.Linear and nn.Embedding initial weights by 100×. Mathematically: instead of \(W \sim \mathcal{N}(0, \sigma_\text{Xavier}^2)\), use \(W \sim \mathcal{N}(0, (100 \sigma_\text{Xavier})^2)\).
What happens internally¶
Forward pass: with input magnitude \(\|x\| \sim O(1)\) and \(W\) inflated 100×, the first layer's output has magnitude \(\sim 100\). Through the two-layer stack (each layer with normalization, but normalization can only compensate for scale, not magnitude of the un-normalized residual stream), magnitudes compound. By the LM head, pre-softmax logits are in the range \(\pm 10^3\). softmax of such logits is one-hot; log of \(\sim 0\) in the softmax denominator → \(\log(0) = -\infty\) → NaN.
Time to NaN: typically 5-30 steps.
Dashboard signature¶
- Panel 4 (activations) is the first to scream. Layer 0's activation magnitude is 50-200× the healthy baseline from step 0. The final LN output is in the hundreds. Bright red, immediate.
- Panel 3 (grad norm pre-clip) follows. Gradients of
lossw.r.t.weightsare proportional to activation magnitudes squared (via the chain rule); pre-clip norm hits \(10^4\) to \(10^6\) within 5 steps. Post-clip pegs at the clip threshold (1.0). - Panel 1 (loss) goes to NaN within 10-30 steps. After that point, all subsequent panels show NaN; the run effectively dies.
Diagnostic hint¶
If panel 4 shows the layer-0 activation magnitude already far above healthy at step 0 (before any training has happened), the init is wrong. Healthy init produces activations of the same order at step 0 as at step 100; the very first forward pass already encodes whether init is sane.
Why this is a real-world bug¶
Real-world cases: someone copy-pastes a Kaiming init that assumes ReLU into a stack that uses GELU and forgets to retune the variance. Or somebody implements Xavier with 1/fan_in instead of 2/(fan_in + fan_out). The 100× exaggeration here is to make the lesson unambiguous; the real-world versions are 2-5× and produce a subtler version of the same dashboard.
Failure 2: No warmup (warmup_steps = 0)¶
The corruption¶
Set warmup = 0 in the schedule. The first optimizer step at global_step = 0 uses LR = lr_max directly (or whatever cosine_schedule(0, T, 0, lr_max, lr_min) evaluates to — usually lr_max).
What happens internally¶
At step 0, AdamW's \(m_0, v_0\) are uninitialized (zero). The first gradient \(g_0\) from random-init parameters has variance that's not yet well-approximated by \(v_0\), so \(\hat v_0\) is wildly biased. The update step:
When \(\hat v_0\) is underestimated (because \(v_0 = 0\) and the bias correction divides by \((1 - 0.95)\)), the divisor is too small, the step is too large. The model jumps to a bad region of parameter space.
Typically, the loss drops slightly (the step did improve), spikes hard at step 5-20 (the bad region's gradients are wild), then recovers slowly over ~200 steps.
Dashboard signature¶
- Panel 3 (grad norm pre-clip) spikes at step 1-5. Often 10-50× the healthy peak. Post-clip pegs at 1.0 for a few steps.
- Panel 1 (loss curve) shows a V-shape early. Loss drops, spikes, drops, normalizes. The spike is small in absolute terms but visible on the log axis.
- Panel 2 (LR) is the smoking gun. It starts at
lr_maxinstead of 0. If you don't think to look at panel 2, you might attribute the issue to bad data or bad init; panel 2 makes the diagnosis immediate. - Panels 4, 5, 6 show transient excursions but settle into healthy regions by step 200.
Diagnostic hint¶
The combination of "Panel 3 spike at step 1-5" + "Panel 2 starts at lr_max" is uniquely indicative of missing warmup. Bad-init produces persistent magnitude excursions in Panel 4; broken-mask produces no spike in Panel 3 at all (the gradients are healthy for the easy task).
Why this is a real-world bug¶
Real-world cases: someone using a schedule library that doesn't have warmup by default. Or warmup_steps is configured but the variable is read after the optimizer is built (so it uses the default = 0). The dashboard's Panel 2 catches this in one glance.
Failure 3: Broken causal mask (mask = ones, not tril)¶
The corruption¶
Replace the lower-triangular causal mask with an all-true mask:
# Healthy:
causal_mask = np.tril(np.ones((L, L), dtype=bool))
# Broken:
causal_mask = np.ones((L, L), dtype=bool)
Each token can now attend to future tokens. Training becomes an easier task: predict \(y_l\) from \(x_0, x_1, \ldots, x_L\) (including \(x_{l+1} = y_l\) in the input).
What happens internally¶
At training time, the model "cheats" — at position \(l\) the input itself contains \(x_{l+1} = y_l\). The model learns to copy from the future, which is trivial. Training loss drops to near 0 within 100-200 steps.
At validation time, the same broken mask is in use. Val loss also looks low. The mask is broken in both train and val passes — so the bug is consistent, just wrong.
But: the model has not learned to predict from the past. It has learned identity-mapping. When deployed for generation (Phase 21), it generates garbage — at inference time, the future tokens don't exist, the mask is materially different, and the model has no idea what to do.
Dashboard signature (subtle!)¶
- Panel 1 (loss curves) shows train approaching 0 fast. Within 100 steps, training loss is below 0.1. If your model has ~103k params and the corpus has hundreds of examples, the model can't legitimately memorize the corpus that fast — unless it's solving a trivial task.
- Val loss tracks train loss closely (both approaching 0). This is the suspicious part. Real overfitting shows train ≪ val; the broken mask shows train ≈ val, both very small. The model isn't overfitting to the training set; it's solving an easier task that happens to apply equally well to val.
- Panel 6 (dead heads) often shows high dead-head counts. Why: when the mask allows future-attention, many heads can be trivially solved by "attend to position +1, copy." Only one head needs to do this; the others are redundant and effectively die.
- Panels 2, 3, 4, 5 look mostly normal. This is the trap — five of six panels look fine.
Diagnostic hint¶
The signature is panel 1 + panel 6 together: loss too low, dead-head count abnormally high. Either alone is ambiguous. Together, "either the model is too good, or it's solving the wrong task," and "many heads are redundant," is the broken-mask diagnosis.
The other heuristic: if loss seems too good to be true, it is. Compute the entropy of the corpus. The model's loss can't go below that entropy unless something is wrong.
Why this is a real-world bug¶
Real-world cases: someone enabling KV-cache logic that assumes a different mask convention. Or an attention implementation where the mask is applied at the wrong place (additive -inf vs multiplicative 0). This is one of the most common attention bugs in published code; the curriculum makes it a Phase-19 lesson because diagnosing it later, in Phase 21 or beyond, is much more expensive.
How the three break dashboards differ at a glance¶
| Panel | Healthy | Failure 1 (init) | Failure 2 (warmup) | Failure 3 (mask) |
|---|---|---|---|---|
| 1 Loss | Smooth descent | NaN by step 30 | V-shape at step 5 | Approaches 0 (suspiciously) |
| 2 LR | Warmup then cosine | Doesn't matter (NaN'd) | Starts at lr_max | Healthy |
| 3 Grad norm | Stable, sometimes clipped | Explodes immediately | Spikes at step 1-5 | Healthy |
| 4 Activations | Stable | Already huge at step 0 | Transient spikes | Healthy |
| 5 Spectral | Stable | Explodes | Brief excursion | Healthy |
| 6 Dead | ~5% | Effectively all (post-NaN) | Healthy after recovery | High in attention |
The diagnostic algorithm is: which panel screams first, and what shape is the scream.
One-paragraph recap¶
Three engineered breaks each produce a unique dashboard signature. Bad init: Panel 4 already shows enormous activation magnitudes at step 0; loss NaNs within 30 steps. No warmup: Panel 2 starts at lr_max instead of 0, Panel 3 spikes at step 1-5, loss recovers. Broken mask: Panels 1 and 6 together — loss is suspiciously low and dead-head count is unusually high. Internalize these three signatures and they protect you across all the future training phases of the curriculum.
Next: lab/00-instrument-hooks.md.