English · Español
Break — train without grad clipping; reproduce a loss spike on purpose¶
🇪🇸 Apagamos el clip de gradiente y dejamos que un token raro del corpus §A13 (
ttendewritten) genere una spike cuando aparece concentrado en un batch. Es la versión sintética del post-mortem del archivo de teoría 04 — ahora Borja la causa, no la observa.
Symptom Borja will see¶
Two runs:
- Run A (control): grad-clip threshold = 1.0, default batching, seed 42.
- Run B (break): grad-clip threshold = \(\infty\) (effectively no clipping; in code:
clip = float("inf")), same seed, same batching.
By step ~312 of Run B, the loss panel will show a vertical spike from ~2.3 to ~12+, and either:
- (60% probable) recover slowly over 100-200 steps, settling 0.5-1.0 worse than Run A's loss curve, with a permanently-elevated gradient norm baseline;
- (40% probable) diverge to NaN within 5 steps and never recover.
The grad-norm panel will show a single isolated spike of 30-80× the baseline at that step.
The break, mechanically¶
In experiments/19-break-no-clip/config.yaml:
Or in code: in src/minitrain/loop.py, change
to
That's it. The whole break is removing one safety net.
Why this teaches the concept¶
At §A13 scale, the BPE tokenizer (Phase 11) produces a token vocabulary where the verb write and its conjugations (writes, wrote, written, writing) split into multi-token sequences. The token tten (from written) is rare — it appears about 5 times in the 240-sentence training set, only inside conjugations of write.
When stochastic batching happens to put 3 sentences containing written into the same batch of 8, the rare token's embedding row receives a gradient signal proportional to 3 instances of "this row was wrong by ~\(\ln V\) nats". The single-row gradient has Frobenius norm \(\sim 50\), and the global gradient norm is dominated by this one row.
Without clipping: the optimizer takes a giant step on the tten embedding row (and a smaller-but-still-large step on every other parameter, because AdamW's moment estimates are global). The model moves to a part of parameter space where:
- The
ttenembedding is overshot — gradients on the next batch over-correct, oscillating. - The moment estimates \(v_t\) now contain a spike that takes \(\sim 1/(1 - \beta_2) = 20\) steps to fade.
- Other parameters have been updated with
lr · m̂ / √v̂wherev̂is smaller than it should be for this batch (it was updated with last batch's smallerg²), so their updates are too aggressive too.
Result: the spike is not a single-step event but a multi-step destabilization. The "recovery" the loss curve shows is actually the optimizer slowly re-calibrating its moments after a corruption.
This is the §A13-scale version of the failure mode Chowdhery et al. (2022) describe for PaLM. Same shape, smaller numbers.
Diagnostic ladder Borja should walk¶
- First check: the loss panel. The spike is at step 312, sharp and unmissable.
- Second check: the grad-norm panel. Pre-clip norm at step 312 is ~50, baseline ~0.6. Post-clip norm is... also ~50 (because there is no clip). This is the smoking gun.
- Third check: the batch composition log at step 312 (Phase 19's instrumentation includes this). It shows 3 sentences containing the verb
writein past-participle form. - Fourth check: the per-token loss histogram at step 312. There's a heavy right tail with mass concentrated on the
ttentoken. - Diagnosis: rare token + concentrated batch + no clip = single-step destabilization.
Reproducer¶
# Control
seed=42 grad_clip=1.0 just phase-19-train
# Break
seed=42 grad_clip=inf just phase-19-train
# Compare
just phase-19-compare experiments/19-control experiments/19-break-no-clip
Hint cascade¶
- (Mild) "Look at the grad-norm panel near the loss spike. Does anything before that panel hint at the cause?"
- (Medium) "What is the post-clip grad norm at step 312? What does it tell you about the clip threshold?"
- (Direct) "The clip is disabled. With a rare token concentrated in one batch, what is the single-step impact on the optimizer's moment estimates?"
Fix¶
Restore grad_clip = 1.0. Or, to teach a complementary lesson, restore grad_clip = 0.5 and observe that the slightly tighter threshold leaves the rolling-mean grad norm (0.6) just under the clip, so most steps are uneffected, but the spike at step 312 is contained.
Either fix demonstrates: gradient clipping is the cheap defense against this failure mode. The deeper fix — stratified batching to prevent rare-token concentration — is the correct defense, but requires a data-loader change.
What this break is NOT¶
- Not a numerical-overflow break (we're in fp32 throughout).
- Not an init break (the model starts healthy, the spike happens at step 312, not step 0).
- Not an LR schedule break (LR is smooth cosine).
It is a defenses-removed break, and it teaches that grad-clip is not optional — it's the cheap insurance that lets the optimizer survive a probabilistic concentration of long-tail tokens in a single batch.
Cross-refs¶
theory/04-loss-spike-postmortem-template.md— the worked example matches this break.stability-check.md§2 — the spike-detection decision tree.- Phase 18
theory/02-optimizer-and-schedule.md— the gradient-clipping math.