English · Español
Phase 18 — Quizzes (mirror)¶
🇪🇸 Las preguntas canónicas viven en
data/quizzes/phase-18-training-loop.yaml. Este archivo es el espejo en Markdown para repaso rápido.
The source of truth is data/quizzes/phase-18-training-loop.yaml. The portal seeds quizzes from there. This page mirrors them for quick reading without spinning up the portal.
q-18-01 — AdamW decoupling: where does λθ enter?¶
Prompt (EN): In AdamW, where does the weight-decay term λ·θ_{t-1} enter the per-step update?
- A. Added to the gradient
g_tbefore the moment update. - B. Added to the parameter update, alongside the bias-corrected
m̂ / (√v̂ + ε)term. - C. Multiplied into the learning-rate schedule.
- D. Subtracted from
v_tto control variance.
Correct: B. The "W" in AdamW is decoupled weight decay — the term enters the update, not the gradient. Adding it to g_t is Adam-L2, a different algorithm.
q-18-02 — Warmup duration¶
Prompt (EN): With lr_max = 3e-4, warmup_steps = 100, what is the learning rate at step 25 (linear warmup)?
- A.
3e-4 - B.
1.5e-4 - C.
7.5e-5 - D.
0
Correct: C. Linear warmup at step t gives lr_max · t / W = 3e-4 · 25 / 100 = 7.5e-5.
q-18-03 — Gradient clipping policy¶
Prompt (EN): Why use global L2-norm clipping instead of per-tensor norm clipping?
Free response. Expected mentions: preserves direction of the update; per-tensor changes the direction across parameters; global clip rescales uniformly.
q-18-04 — Which decay configurations are valid for an MLP block at §A13 scale?¶
Prompt (EN): Select every configuration that is reasonable for the AdamW param_groups of an MLP block at §A13 scale.
- A. Decay applied to Linear weights with
λ = 0.1. - B. Decay applied to LayerNorm scale parameters with
λ = 0.1. - C. Decay applied to bias vectors with
λ = 0.1. - D. No decay on biases or LN scale;
λ = 0.1on Linear / Embedding weights.
Correct: A, D. Decay on bias / LN scale (B, C) collapses these toward zero, hurting expressiveness without a reason — they aren't where overfitting lives.
q-18-05 — Bias correction omission¶
Prompt (EN): A learner reports that "warmup is too aggressive — even at step 50 the updates look tiny". They are using lr_max = 3e-4, warmup = 100. What is the most likely bug?
- A. The warmup schedule is wrong.
- B. The optimizer is using
m_tandv_tdirectly without bias correctionm̂_t = m_t/(1−β₁^t). - C. The clip threshold is too low.
- D. The dataloader is shuffled per-step and never reaches the same example twice.
Correct: B. At small t, m_t ≈ (1 − β₁) · g₁ ≈ 0.1 g₁. Without bias correction, the first ~50 steps update at ~10% of intended magnitude — looks like aggressive warmup, is actually missing bias correction.