Skip to content

English · Español

Break — train without grad clipping; reproduce a loss spike on purpose

🇪🇸 Apagamos el clip de gradiente y dejamos que un token raro del corpus §A13 (tten de written) genere una spike cuando aparece concentrado en un batch. Es la versión sintética del post-mortem del archivo de teoría 04 — ahora Borja la causa, no la observa.


Symptom Borja will see

Two runs:

  • Run A (control): grad-clip threshold = 1.0, default batching, seed 42.
  • Run B (break): grad-clip threshold = \(\infty\) (effectively no clipping; in code: clip = float("inf")), same seed, same batching.

By step ~312 of Run B, the loss panel will show a vertical spike from ~2.3 to ~12+, and either:

  • (60% probable) recover slowly over 100-200 steps, settling 0.5-1.0 worse than Run A's loss curve, with a permanently-elevated gradient norm baseline;
  • (40% probable) diverge to NaN within 5 steps and never recover.

The grad-norm panel will show a single isolated spike of 30-80× the baseline at that step.

The break, mechanically

In experiments/19-break-no-clip/config.yaml:

# Run B (break)
optimizer:
  name: adamw
  weight_decay: 0.1
  grad_clip: null   # was 1.0 in Run A

Or in code: in src/minitrain/loop.py, change

clip_factor = min(1.0, self.clip / (g_norm + 1e-12))

to

clip_factor = 1.0   # the break

That's it. The whole break is removing one safety net.

Why this teaches the concept

At §A13 scale, the BPE tokenizer (Phase 11) produces a token vocabulary where the verb write and its conjugations (writes, wrote, written, writing) split into multi-token sequences. The token tten (from written) is rare — it appears about 5 times in the 240-sentence training set, only inside conjugations of write.

When stochastic batching happens to put 3 sentences containing written into the same batch of 8, the rare token's embedding row receives a gradient signal proportional to 3 instances of "this row was wrong by ~\(\ln V\) nats". The single-row gradient has Frobenius norm \(\sim 50\), and the global gradient norm is dominated by this one row.

Without clipping: the optimizer takes a giant step on the tten embedding row (and a smaller-but-still-large step on every other parameter, because AdamW's moment estimates are global). The model moves to a part of parameter space where:

  • The tten embedding is overshot — gradients on the next batch over-correct, oscillating.
  • The moment estimates \(v_t\) now contain a spike that takes \(\sim 1/(1 - \beta_2) = 20\) steps to fade.
  • Other parameters have been updated with lr · m̂ / √v̂ where is smaller than it should be for this batch (it was updated with last batch's smaller ), so their updates are too aggressive too.

Result: the spike is not a single-step event but a multi-step destabilization. The "recovery" the loss curve shows is actually the optimizer slowly re-calibrating its moments after a corruption.

This is the §A13-scale version of the failure mode Chowdhery et al. (2022) describe for PaLM. Same shape, smaller numbers.

Diagnostic ladder Borja should walk

  1. First check: the loss panel. The spike is at step 312, sharp and unmissable.
  2. Second check: the grad-norm panel. Pre-clip norm at step 312 is ~50, baseline ~0.6. Post-clip norm is... also ~50 (because there is no clip). This is the smoking gun.
  3. Third check: the batch composition log at step 312 (Phase 19's instrumentation includes this). It shows 3 sentences containing the verb write in past-participle form.
  4. Fourth check: the per-token loss histogram at step 312. There's a heavy right tail with mass concentrated on the tten token.
  5. Diagnosis: rare token + concentrated batch + no clip = single-step destabilization.

Reproducer

# Control
seed=42 grad_clip=1.0 just phase-19-train

# Break
seed=42 grad_clip=inf just phase-19-train

# Compare
just phase-19-compare experiments/19-control experiments/19-break-no-clip

Hint cascade

  1. (Mild) "Look at the grad-norm panel near the loss spike. Does anything before that panel hint at the cause?"
  2. (Medium) "What is the post-clip grad norm at step 312? What does it tell you about the clip threshold?"
  3. (Direct) "The clip is disabled. With a rare token concentrated in one batch, what is the single-step impact on the optimizer's moment estimates?"

Fix

Restore grad_clip = 1.0. Or, to teach a complementary lesson, restore grad_clip = 0.5 and observe that the slightly tighter threshold leaves the rolling-mean grad norm (0.6) just under the clip, so most steps are uneffected, but the spike at step 312 is contained.

Either fix demonstrates: gradient clipping is the cheap defense against this failure mode. The deeper fix — stratified batching to prevent rare-token concentration — is the correct defense, but requires a data-loader change.

What this break is NOT

  • Not a numerical-overflow break (we're in fp32 throughout).
  • Not an init break (the model starts healthy, the spike happens at step 312, not step 0).
  • Not an LR schedule break (LR is smooth cosine).

It is a defenses-removed break, and it teaches that grad-clip is not optional — it's the cheap insurance that lets the optimizer survive a probabilistic concentration of long-tail tokens in a single batch.

Cross-refs

  • theory/04-loss-spike-postmortem-template.md — the worked example matches this break.
  • stability-check.md §2 — the spike-detection decision tree.
  • Phase 18 theory/02-optimizer-and-schedule.md — the gradient-clipping math.