Skip to content

English · Español

Phase 19 — Quizzes (mirror)

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-19-training-dynamics.yaml. Este archivo es el espejo en Markdown para repaso rápido.


q-19-01 — fp16 overflow signature

Prompt (EN): A run in fp16 with no loss scaling shows: forward pass produces finite activations, but grad-norm becomes inf at step 432. What is the most likely cause?

  • A. Bad init.
  • B. Activation magnitude exceeded 65504 in the backward pass.
  • C. LR too high.
  • D. Optimizer state corruption.

Correct: B. fp16 max is 65504; the backward pass produces gradients whose intermediate values exceeded this. Loss scaling moves the gradients into the representable range.


q-19-02 — Spike severity threshold

Prompt (EN): Using a rolling 100-step window of training loss, what is the standard "spike" threshold you'd flag in the dashboard?

Free response. Expected mentions: (or "3 sigma") above rolling mean.


q-19-03 — Recoverable vs persistent spike

Prompt (EN): A loss spike at step 500 brings loss from 2.3 to 8.0, then settles at 5.0 by step 600 and does not return to the pre-spike trajectory. What is the right first action?

  • A. Lower learning rate and continue.
  • B. Reload the last pre-spike checkpoint, then apply a preventative fix.
  • C. Increase weight decay.
  • D. Restart from scratch with new seed.

Correct: B. The persistent elevation means the optimizer's moments have been corrupted; continuing wastes steps. Reload + preventative fix is the cheapest path back.


q-19-04 — Which signals are reliable for spike root-cause analysis?

Prompt (EN): Select every dashboard panel/signal that is informative for distinguishing a long-tail-token spike from an LR-schedule-induced spike.

  • A. Pre-clip gradient norm at the spike step.
  • B. LR value over the window around the spike.
  • C. Batch composition log at the spike step.
  • D. Final eval perplexity.

Correct: A, B, C. Pre-clip grad norm shows whether one batch was anomalously costly; LR panel shows schedule continuity; batch composition reveals long-tail concentration. Final eval PPL is too far downstream to discriminate causes.


q-19-05 — Long-tail concentration mechanism

Prompt (EN): In one or two sentences, explain why a single batch with three sentences containing a rare BPE token can produce a 30× spike in global gradient norm at §A13 scale.

Free response. Expected mentions: rare embedding row sees a disproportionate share of the batch's loss; gradient on that row is large; global L2 norm is dominated by that single row.