English · Español
04 — Loss-spike post-mortem template (running example: §A13 grammar tutor)¶
🇪🇸 Una spike de loss no es un misterio — es una historia con sujeto, verbo y predicado. Este template separa síntomas (lo que viste) de hipótesis (lo que crees) de remedio (lo que cambias). Lo aplicamos al entrenamiento del tutor de gramática para que la próxima vez que veas una sierra en el dashboard, escribas el post-mortem en 20 minutos, no en 2 horas.
A reusable template for documenting and recovering from training instability. The format is opinionated — every section must be filled. Empty sections are themselves diagnostic.
We work the template through a real (engineered) §A13 grammar-tutor spike below.
The template¶
# Loss spike post-mortem — <date> — <run-id>
## 1. Symptoms (what you observed; numbers only, no theories)
- step where spike began: <int>
- pre-spike loss (50-step EMA): <float>
- peak loss: <float>
- recovery loss (50 steps after peak), or "did not recover": <float or "NaN">
- grad-norm pre-clip at spike: <float>
- LR at spike: <float>
- dtype regime: fp32 / fp16 / bf16
- batch index in the spike step (if logged): <int>
- corpus token(s) in that batch that look unusual: <list or "none flagged">
## 2. Hypotheses (rank by likelihood; cite the dashboard panel that supports each)
1. <hypothesis>: supported by <panel/observation>
2. <hypothesis>: supported by <panel/observation>
3. <hypothesis>: supported by <panel/observation>
## 3. Discriminating evidence (which panels rule each hypothesis in or out)
- Panel <n>: <observation> → favors H<k>, rules out H<j>
- Panel <n>: <observation> → ...
## 4. Root cause (a single sentence; if you can't write one, you don't know yet)
<one sentence>
## 5. Remediation (the smallest change that fixes the root cause)
- <change>
## 6. Verification (the test that confirms the fix)
- Re-run with seed <s>, expect loss curve within <ε> of baseline through step <T>.
## 7. Lesson (one line, added to learners/<name>/phase-NN/notes/lessons.md)
<one line>
Worked example — §A13 grammar tutor, run 19-spike-001¶
Setup: training the mini-GPT from Phase 17 on the §A13 corpus. AdamW, \(\eta_\text{max} = 3 \times 10^{-4}\), warmup 100, cosine to 2000 steps, batch size 8, fp32. At step 312 the loss jumps from 2.31 to 5.84, then settles at 3.40 by step 360.
1. Symptoms¶
- step where spike began: 312
- pre-spike loss (50-step EMA): 2.31
- peak loss (step 313): 5.84
- recovery loss (step 360): 3.40
- grad-norm pre-clip at spike: 47.2 (baseline rolling: 0.6)
- LR at spike: \(2.94 \times 10^{-4}\) (cosine, near peak)
- dtype: fp32
- batch index 312: contained 3 sentences with the verb
writein past-participle form (written), and the BPE tokenizer splitwrittenaswri+tten. - unusual: yes,
ttenis a low-frequency BPE token (appears only in conjugations ofwriteandbitten-like forms, ~5 occurrences in the entire 240-sentence train set).
2. Hypotheses (ranked)¶
- Long-tail token outlier. A rare token (
tten) appeared in 3 sentences of a single batch (probabilistic concentration). Its embedding has not been updated enough times to be well-calibrated; the cross-entropy loss on this token is enormous, producing a giant gradient on its embedding row. Supported by: grad-norm panel spike + the batch composition log. - Numerical (fp32 round-off). Unlikely in fp32 — round-off accumulates over many steps, not in a single step. Would be more plausible in fp16.
- LR schedule bug. Possible but unlikely; LR panel shows smooth cosine, no jump. Rules itself out.
- Data-loader bug (same batch repeated). Check: batch hash 312 ≠ batch hash 311 ≠ 313. Rules itself out.
3. Discriminating evidence¶
- Panel 3 (grad norm pre-clip): 47.2 at step 312, baseline 0.6. Decisive for H1 (long-tail).
- Panel 4 (activations): layer-0 embedding activation at step 312 is ~3.2× baseline (because the rare
ttentoken's embedding row is being weighted in three sequences of the batch). Confirms H1. - Panel 7 (per-token loss histogram): the loss histogram at step 312 has a heavy right tail with a mode at ~12.0 (negative log-prob of
ttengiven context). The mass at this mode is concentrated in 3 sentences. Confirms H1. - Panel 2 (LR): smooth cosine, no jump. Rules out H3.
4. Root cause¶
A single batch contained an overrepresented rare BPE token (tten) whose embedding gradient dominated the global gradient norm, exceeded the clip threshold by 47×, and pushed the optimizer's moment estimates into a region that took ~50 steps to recover from.
5. Remediation¶
Two complementary changes:
- Reduce grad-clip threshold from 1.0 to 0.5. This caps the per-step damage of any future long-tail batch. The pre-clip rolling norm is 0.6, so 0.5 will engage clipping on the bottom-10% noise floor — acceptable cost.
- Stratified batching —
scripts/build_loader.pyshould ensure no batch contains more than 1 occurrence of any rare BPE token (set: tokens with \(<10\) occurrences in train). This is cheap: O(N log N) to sort and rebalance.
A third option, not taken: re-tokenize so written becomes a single token. This would hide the long-tail problem, not solve it — Phase 27 will revisit at scale.
6. Verification¶
Re-run with seed 42, expect loss curve within \(\pm 0.05\) of pre-spike baseline through step 500. Specifically:
- step 312 loss: \(< 2.5\) (down from 5.84 in the broken run)
- max grad-norm in window [300, 400]: \(< 3.0\)
- max grad-norm clipped: \(\leq 0.5\) (the new threshold)
7. Lesson¶
One overrepresented rare token in a single batch can dominate the global gradient norm and destabilize the optimizer for 50+ steps; stratified batching + lower clip threshold prevents recurrence.
Why the template format matters¶
The discipline of writing symptoms as numbers only before listing hypotheses prevents "I think the LR is too high" from contaminating the diagnosis before the data is even read. The discipline of ranking hypotheses (not just listing them) forces you to commit to a most-likely cause before testing. The discipline of one-sentence root cause forces you to know what you fixed.
Most loss spikes follow one of four patterns:
- Long-tail token spike (this example): a rare BPE token concentrated in one batch.
- fp16 overflow: bf16/fp16 with no loss scaling, an activation magnitude exceeds 65504. See
stability-check.md§3. - LR-schedule discontinuity: warmup ends suddenly, or a restart is misaligned.
- Optimizer reset: a checkpoint was reloaded mid-run without restoring
m, vstate, so the first ~10 steps after reload have un-calibrated moments.
Pattern 1 is the §A13-most-likely. Pattern 2 only appears in Phase 19's mixed-precision lab. Patterns 3 and 4 are easy to spot once you know to look at the LR and the moment-state panels respectively.
When the template is hardest to fill¶
Spikes that don't recover (loss → NaN, stays NaN). For these:
- Symptom section (§1) ends with "did not recover" and the recovery loss is "NaN".
- Hypothesis section must include "destructive update wrote a NaN into a parameter that propagates forever". Check by hashing the parameter state at every step and identifying the first step where the hash changes "violently".
- Remediation must include reverting to the most recent NaN-free checkpoint, which assumes you were checkpointing every \(K\) steps. If you weren't, restart from scratch with smaller LR.
This is the failure mode stability-check.md §4 addresses.
Citation¶
Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. Section 5.1 ("Training instability") documents the long-tail spike pattern at PaLM scale and the mitigations Google's team used: same shape, much larger numbers. Reading section 5.1 after this template gives the cross-scale view.
One-paragraph recap¶
A loss-spike post-mortem is a structured document: numerical symptoms, ranked hypotheses, discriminating panel evidence, single-sentence root cause, smallest-change remediation, verifiable test, one-line lesson. Walking the template forces the diagnosis from observation to action without skipping the "which evidence rules out which alternative" step that most ad-hoc debugging skips. For the §A13 grammar tutor, the most common cause is a rare BPE token over-represented in one batch; remediation is stratified batching + tighter clip threshold; the lesson goes into the learner's notebook for next time.
Cross-refs: theory/03-three-failure-modes.md (the three engineered breaks the dashboard signatures come from), stability-check.md (the runnable decision tree this template feeds into), Phase 18 theory/02-optimizer-and-schedule.md (the optimizer state the moments live in).