English · Español

Break — train without grad clipping; reproduce a loss spike on purpose¶

🇪🇸 Apagamos el clip de gradiente y dejamos que un token raro del corpus §A13 (tten de written) genere una spike cuando aparece concentrado en un batch. Es la versión sintética del post-mortem del archivo de teoría 04 — ahora Borja la causa, no la observa.

Symptom Borja will see¶

Two runs:

Run A (control): grad-clip threshold = 1.0, default batching, seed 42.
Run B (break): grad-clip threshold = \(\infty\) (effectively no clipping; in code: clip = float("inf")), same seed, same batching.

By step ~312 of Run B, the loss panel will show a vertical spike from ~2.3 to ~12+, and either:

(60% probable) recover slowly over 100-200 steps, settling 0.5-1.0 worse than Run A's loss curve, with a permanently-elevated gradient norm baseline;
(40% probable) diverge to NaN within 5 steps and never recover.

The grad-norm panel will show a single isolated spike of 30-80× the baseline at that step.

The break, mechanically¶

In experiments/19-break-no-clip/config.yaml:

# Run B (break)
optimizer:
  name: adamw
  weight_decay: 0.1
  grad_clip: null   # was 1.0 in Run A

Or in code: in src/minitrain/loop.py, change

clip_factor = min(1.0, self.clip / (g_norm + 1e-12))

to

clip_factor = 1.0   # the break

That's it. The whole break is removing one safety net.

Why this teaches the concept¶

At §A13 scale, the BPE tokenizer (Phase 11) produces a token vocabulary where the verb write and its conjugations (writes, wrote, written, writing) split into multi-token sequences. The token tten (from written) is rare — it appears about 5 times in the 240-sentence training set, only inside conjugations of write.

When stochastic batching happens to put 3 sentences containing written into the same batch of 8, the rare token's embedding row receives a gradient signal proportional to 3 instances of "this row was wrong by ~\(\ln V\) nats". The single-row gradient has Frobenius norm \(\sim 50\), and the global gradient norm is dominated by this one row.

Without clipping: the optimizer takes a giant step on the tten embedding row (and a smaller-but-still-large step on every other parameter, because AdamW's moment estimates are global). The model moves to a part of parameter space where:

The tten embedding is overshot — gradients on the next batch over-correct, oscillating.
The moment estimates \(v_t\) now contain a spike that takes \(\sim 1/(1 - \beta_2) = 20\) steps to fade.
Other parameters have been updated with lr · m̂ / √v̂ where v̂ is smaller than it should be for this batch (it was updated with last batch's smaller g²), so their updates are too aggressive too.

Result: the spike is not a single-step event but a multi-step destabilization. The "recovery" the loss curve shows is actually the optimizer slowly re-calibrating its moments after a corruption.

This is the §A13-scale version of the failure mode Chowdhery et al. (2022) describe for PaLM. Same shape, smaller numbers.

Diagnostic ladder Borja should walk¶

First check: the loss panel. The spike is at step 312, sharp and unmissable.
Second check: the grad-norm panel. Pre-clip norm at step 312 is ~50, baseline ~0.6. Post-clip norm is... also ~50 (because there is no clip). This is the smoking gun.
Third check: the batch composition log at step 312 (Phase 19's instrumentation includes this). It shows 3 sentences containing the verb write in past-participle form.
Fourth check: the per-token loss histogram at step 312. There's a heavy right tail with mass concentrated on the tten token.
Diagnosis: rare token + concentrated batch + no clip = single-step destabilization.

Reproducer¶

# Control
seed=42 grad_clip=1.0 just phase-19-train

# Break
seed=42 grad_clip=inf just phase-19-train

# Compare
just phase-19-compare experiments/19-control experiments/19-break-no-clip

Hint cascade¶

(Mild) "Look at the grad-norm panel near the loss spike. Does anything before that panel hint at the cause?"
(Medium) "What is the post-clip grad norm at step 312? What does it tell you about the clip threshold?"
(Direct) "The clip is disabled. With a rare token concentrated in one batch, what is the single-step impact on the optimizer's moment estimates?"

Fix¶

Restore grad_clip = 1.0. Or, to teach a complementary lesson, restore grad_clip = 0.5 and observe that the slightly tighter threshold leaves the rolling-mean grad norm (0.6) just under the clip, so most steps are uneffected, but the spike at step 312 is contained.

Either fix demonstrates: gradient clipping is the cheap defense against this failure mode. The deeper fix — stratified batching to prevent rare-token concentration — is the correct defense, but requires a data-loader change.

What this break is NOT¶

Not a numerical-overflow break (we're in fp32 throughout).
Not an init break (the model starts healthy, the spike happens at step 312, not step 0).
Not an LR schedule break (LR is smooth cosine).

It is a defenses-removed break, and it teaches that grad-clip is not optional — it's the cheap insurance that lets the optimizer survive a probabilistic concentration of long-tail tokens in a single batch.

Cross-refs¶

theory/04-loss-spike-postmortem-template.md — the worked example matches this break.
stability-check.md §2 — the spike-detection decision tree.
Phase 18 theory/02-optimizer-and-schedule.md — the gradient-clipping math.