English · Español

01 — What to instrument: the seven panels¶

🇪🇸 Siete paneles, no más. Cada uno responde a una pregunta concreta sobre la salud del entrenamiento. Menos paneles = ciegos a fallos. Más paneles = ruido + overhead.

The dashboard has seven panels. Each answers one specific question. Adding an eighth panel means you've justified another question worth answering at every training step. Removing one means you've decided you don't need its question answered. Both decisions are revisable, but at the time of writing, seven is correct.

Panel 1: Loss curves (train + val)¶

Question: Is the model learning?

X-axis: training step (linear). Y-axis: loss (log scale — the early steps span an order of magnitude). Two lines: train (every log_every steps), val (every val_every steps, interpolated as steps for plotting). Annotation: Phase-14 n-gram baseline as a horizontal dashed line on the val axis.

Signatures: - Healthy: both curves drop monotonically, train slightly below val, both below baseline at the end. - Bug #2 (no warmup): train curve has a spike to 8-10 within the first 100 steps, then drops. - Bug #3 (broken mask): train curve drops below baseline within 200 steps; val curve flat or barely moves. - Overfitting: train and val curves diverge after some step \(t^*\), train continues down, val plateaus or rises.

Panel 2: LR schedule¶

Question: Did warmup and cosine decay actually run?

X-axis: step. Y-axis: learning rate (linear scale).

Why this exists: the most common silent training bug is the scheduler being mis-wired and the model training at one fixed LR for the entire run. The LR schedule panel makes this impossible to miss — a wrong scheduler shows a flat line where the cosine should be.

Signatures: - Healthy: linear ramp from 0 to lr_max over warmup_steps, then smooth cosine decay to lr_min by total_steps. - Bug #2 (no warmup): line is flat at lr_max from step 0. - Wrong total_steps: cosine reaches its minimum before training ends, and the last N steps run at lr_min (wasted).

Panel 3: Gradient norm history¶

Question: Are the gradients well-behaved?

X-axis: step. Y-axis: global \(\|g\|_2\) (log scale). Annotation: clipping threshold c as a horizontal line.

Signatures: - Healthy: \(\|g\|_2\) starts at ~5-10, drops to ~0.1-1.0 over training, stays below the clip threshold most of the time, with occasional spikes near c (clipping engages). - Bug #1 (bad init): \(\|g\|_2\) explodes to \(10^3\)+ within a few steps, then NaN. - A run with no clipping engagement at all: clip threshold too high (useless) or pathology already suppressed gradients.

Panel 4: Per-layer activation magnitudes¶

Question: Are the activations balanced across depth?

X-axis: step (down-sampled). Y-axis: mean magnitude per layer. Lines: one per layer (embed_out, block_0_out, block_1_out, final_ln_out) — Phase 17's locked n_layers = 2.

Signatures: - Healthy: all layers' magnitudes are within ~3× of each other (residual stream is roughly constant in scale; that's what LayerNorm enforces). - Bug #1 (bad init): deep layers' magnitudes grow exponentially with depth. Visible at step 0 (before any training). - A "dead layer": its magnitude is near zero from the start (init bug specific to that layer).

Panel 5: Per-layer weight spectral norms¶

Question: Are the weight matrices growing pathologically?

X-axis: step. Y-axis: spectral norm \(\sigma_1(W)\) per layer (log scale). Lines: one per major matrix (attention QKV, FFN up, FFN down, output projection).

Signatures: - Healthy: spectral norms creep up over training, never more than ~2× the initial value. - Bug #1 (bad init): spectral norms start ~100× higher than expected. - Weight decay missing: spectral norms grow without bound; model overfits via parameter magnitude.

Panel 6: Dead neurons / dead heads over time¶

Question: Is the model using its full capacity?

X-axis: step. Y-axis: count of dead neurons (or dead attention heads).

Definitions: - Dead neuron: an FFN hidden unit where |activation| < ε across >99% of the batch. - Dead head: an attention head where H(softmax(QK^T/√d)) is below log 2 (i.e., the attention is concentrated on one key) across >99% of queries.

Signatures: - Healthy: a few dead neurons (5-10% of FFN hidden units) is normal; dead-head count is usually 0 in a well-trained tiny model. - Bug #1 (bad init): nearly all neurons saturated or dead from step 0. - A "stuck" head: stays dead from step 0 through end; the head is wasted.

Panel 7: Per-verb-class loss decomposition (the §A13 panel)¶

Question: Is the model learning the regular pattern, or memorizing individual forms?

X-axis: step. Y-axis: loss (log scale). Two lines: - loss_regular: mean loss over examples whose verb is one of the 12 regulars (work, play, walk, talk, listen, watch, study, finish, start, look, want, like). - loss_irregular: mean loss over examples whose verb is one of the 8 irregulars (be, have, do, go, come, see, eat, write).

Signatures: - Healthy: loss_regular drops faster (the regular -ed rule is one shared abstraction); loss_irregular lags by ~10-30% in loss space; the gap stabilizes around step ~1500. - Memorization: the two curves are tangled — neither leads, neither lags. The model is treating every form independently. - Bug #3 (broken mask): both curves drop to near-zero unreasonably fast, because the model is seeing the future token. The gap between them is near zero, not because the model learned the pattern but because it doesn't need to — it just copies.

This panel is the one most specific to the §A13 topic. Without it, the dashboard works for any LM; with it, the dashboard is uniquely informative for this corpus.

What we deliberately did NOT instrument¶

Per-tensor gradient stats beyond the global norm. Storing per-tensor norms every step is ~25 KB/step for the 2-layer Mini-GPT; over 8000 steps, that's 200 MB. The per-layer view in Panel 4 + the global view in Panel 3 cover 95% of diagnostic needs.
Weight histograms. Useful for some bugs, but: (a) noisy to plot at every step, (b) screen real-estate cost is high, © the spectral-norm panel catches most weight-pathology bugs.
Loss-landscape probing. Would require evaluating loss at many parameter perturbations per step — 10-100× the cost. Out of Phase-19 scope.
Hyperparameter sensitivity charts. Sweep territory — Phase 28.

The reading order on the dashboard¶

When you open a dashboard for a failed run, scan in this order:

Panel 1: Did the loss go down? If no → look at panel 2 (LR), then panel 3 (gradients).
Panel 3: Did the gradients explode/vanish? If yes → bug #1 (bad init) or numerical pathology.
Panel 2: Did the LR schedule run? If no → bug #2 (no warmup).
Panel 1, train vs val: Did val tracking train? If train drops but val flat → bug #3 (broken mask) or data leak.
Panel 4-6: Are layers balanced and alive? If layer 1 is dead while layer 0 is fine → init or initialization-scale problem.
Panel 7: Did the model learn the regular pattern? If irregular loss collapses faster than regular → data leak.

This order surfaces the three engineered breaks in three different panels: Panel 3 (bug #1), Panel 2 (bug #2), Panel 1+7 (bug #3). Memorize this order; the lab's three diagnoses depend on it.

Drill problems¶

The dashboard shows a flat LR line at lr_max from step 0. Which bug? Which panel told you?
The train and val loss curves drop in lockstep through step 3000, then val flattens while train continues. Bug or feature?
The regular-vs-irregular panel shows both curves collapsing to near-zero by step 200, with no gap. Which bug?
Panel 4 shows layer 1's activations 30× larger than layer 0's by step 100. What's happening, and what panel confirms it?

One-paragraph recap¶

The dashboard has seven panels: loss curves, LR schedule, grad-norm history, per-layer activations, per-layer weight spectral norms, dead-neuron count, and the §A13-specific regular-vs-irregular loss decomposition. Each panel surfaces one bug class; the bugs of Phase 19 (bad init, no warmup, broken mask) each show up in a different panel and are diagnosable from the dashboard alone. The seventh panel is the one that turns this dashboard from a generic LM debugger into a verb-grammar curriculum tool. Keep the panel count at seven — adding more = noise and overhead.

What this section does NOT cover¶

The math of each metric (panel 6's dead-neuron criterion, panel 5's spectral norm, panel 7's class decomposition). That's theory/02-dashboard-metrics.md.
The mechanism of each engineered bug. That's theory/03-three-failure-modes.md.
The implementation of the hooks. Lab 00.

Next: theory/02-dashboard-metrics.md.