English · Español

Phase 19 — Quizzes (mirror)¶

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-19-training-dynamics.yaml. Este archivo es el espejo en Markdown para repaso rápido.

q-19-01 — fp16 overflow signature¶

Prompt (EN): A run in fp16 with no loss scaling shows: forward pass produces finite activations, but grad-norm becomes inf at step 432. What is the most likely cause?

A. Bad init.
B. Activation magnitude exceeded 65504 in the backward pass.
C. LR too high.
D. Optimizer state corruption.

Correct: B. fp16 max is 65504; the backward pass produces gradients whose intermediate values exceeded this. Loss scaling moves the gradients into the representable range.

q-19-02 — Spike severity threshold¶

Prompt (EN): Using a rolling 100-step window of training loss, what is the standard "spike" threshold you'd flag in the dashboard?

Free response. Expected mentions: 3σ (or "3 sigma") above rolling mean.

q-19-03 — Recoverable vs persistent spike¶

Prompt (EN): A loss spike at step 500 brings loss from 2.3 to 8.0, then settles at 5.0 by step 600 and does not return to the pre-spike trajectory. What is the right first action?

A. Lower learning rate and continue.
B. Reload the last pre-spike checkpoint, then apply a preventative fix.
C. Increase weight decay.
D. Restart from scratch with new seed.

Correct: B. The persistent elevation means the optimizer's moments have been corrupted; continuing wastes steps. Reload + preventative fix is the cheapest path back.

q-19-04 — Which signals are reliable for spike root-cause analysis?¶

Prompt (EN): Select every dashboard panel/signal that is informative for distinguishing a long-tail-token spike from an LR-schedule-induced spike.

A. Pre-clip gradient norm at the spike step.
B. LR value over the window around the spike.
C. Batch composition log at the spike step.
D. Final eval perplexity.

Correct: A, B, C. Pre-clip grad norm shows whether one batch was anomalously costly; LR panel shows schedule continuity; batch composition reveals long-tail concentration. Final eval PPL is too far downstream to discriminate causes.

q-19-05 — Long-tail concentration mechanism¶

Prompt (EN): In one or two sentences, explain why a single batch with three sentences containing a rare BPE token can produce a 30× spike in global gradient norm at §A13 scale.

Free response. Expected mentions: rare embedding row sees a disproportionate share of the batch's loss; gradient on that row is large; global L2 norm is dominated by that single row.