English · Español
Phase 19 — Quizzes (mirror)¶
🇪🇸 Las preguntas canónicas viven en
data/quizzes/phase-19-training-dynamics.yaml. Este archivo es el espejo en Markdown para repaso rápido.
q-19-01 — fp16 overflow signature¶
Prompt (EN): A run in fp16 with no loss scaling shows: forward pass produces finite activations, but grad-norm becomes inf at step 432. What is the most likely cause?
- A. Bad init.
- B. Activation magnitude exceeded 65504 in the backward pass.
- C. LR too high.
- D. Optimizer state corruption.
Correct: B. fp16 max is 65504; the backward pass produces gradients whose intermediate values exceeded this. Loss scaling moves the gradients into the representable range.
q-19-02 — Spike severity threshold¶
Prompt (EN): Using a rolling 100-step window of training loss, what is the standard "spike" threshold you'd flag in the dashboard?
Free response. Expected mentions: 3σ (or "3 sigma") above rolling mean.
q-19-03 — Recoverable vs persistent spike¶
Prompt (EN): A loss spike at step 500 brings loss from 2.3 to 8.0, then settles at 5.0 by step 600 and does not return to the pre-spike trajectory. What is the right first action?
- A. Lower learning rate and continue.
- B. Reload the last pre-spike checkpoint, then apply a preventative fix.
- C. Increase weight decay.
- D. Restart from scratch with new seed.
Correct: B. The persistent elevation means the optimizer's moments have been corrupted; continuing wastes steps. Reload + preventative fix is the cheapest path back.
q-19-04 — Which signals are reliable for spike root-cause analysis?¶
Prompt (EN): Select every dashboard panel/signal that is informative for distinguishing a long-tail-token spike from an LR-schedule-induced spike.
- A. Pre-clip gradient norm at the spike step.
- B. LR value over the window around the spike.
- C. Batch composition log at the spike step.
- D. Final eval perplexity.
Correct: A, B, C. Pre-clip grad norm shows whether one batch was anomalously costly; LR panel shows schedule continuity; batch composition reveals long-tail concentration. Final eval PPL is too far downstream to discriminate causes.
q-19-05 — Long-tail concentration mechanism¶
Prompt (EN): In one or two sentences, explain why a single batch with three sentences containing a rare BPE token can produce a 30× spike in global gradient norm at §A13 scale.
Free response. Expected mentions: rare embedding row sees a disproportionate share of the batch's loss; gradient on that row is large; global L2 norm is dominated by that single row.