English · Español

00 — Motivation: "Training looks fine" is the worst sentence in ML¶

🇪🇸 El loop entrena, la loss baja, el modelo "funciona". Y sin embargo, el 90% de las veces que se diagnostica un bug, la causa estuvo presente desde el primer paso pero nadie la miró. Phase 19 es la disciplina de mirar siempre.

The premise¶

Phase 18 produced a training loop. It runs. Loss goes down. Val PPL beats baseline. You could call that done.

You shouldn't.

Here are five questions Phase 18 doesn't answer:

Are any layers dead? A transformer can have an entire attention head whose softmax outputs are concentrated on one key for every query; that head is contributing nothing. You can't tell from the loss curve.
Are the gradients balanced across layers? If layer 0 sees gradients 1000× smaller than layer 1, the early layers are barely training. Loss might still drop because the later layers absorb everything.
Is the learning rate appropriate? A loss curve that "smoothly drops" with a 10× higher LR usually has hidden spikes at fine time-resolution; the optimizer is recovering from each spike, which is fine until it isn't.
Is the model memorizing or generalizing? Train loss and val loss tell you the answer, but only if you plot them together at the right resolution.
Is the model learning the right thing? For the verb-grammar corpus: is it picking up the regular -ed past-tense pattern (12 verbs share it), or is it memorizing per-form? A model that hasn't learned the pattern will appear trained — it nailed the train set — but fail spectacularly on the four held-out verbs.

A loss curve answers none of these. Phase 19's dashboard answers all of them.

The thesis¶

You earn the right to optimize a model only after you can describe, with measurements, what it is currently doing.

This is "no debugging without instrumentation" applied to training. In Phase 18 we wired up the loop. In Phase 19 we wire up the forensics.

The deliverable: a single HTML file per training run, self-contained (no JS, no CDN, no external assets), with seven panels:

Loss curves (train + val).
LR schedule (so you can verify the warmup actually warmed up).
Gradient norm history (so you can see clipping in action).
Per-layer activation magnitudes (so dead/exploding layers are obvious).
Per-layer weight spectral norms (so growth is visible).
Dead-neuron / dead-head count over time.
Per-verb-class loss decomposition (regular vs irregular).

Open the file months later — it still tells you everything. That's the point. Tensorboard requires the server. mlflow requires the DB. An HTML file in your repo requires nothing.

Why three engineered breaks¶

Reading is not understanding. The lab introduces three specific training failures:

Bad init — Xavier-init scaled by 100. Activations explode. NaN by step 50.
No warmup — W = 0, full LR from step 0. Loss spikes wildly early, then slowly recovers.
Broken causal mask — the mask is True everywhere; the model sees the future. Train loss drops faster than the n-gram baseline can; val loss stays flat.

Each break produces a signature on the dashboard. The lab does not tell you in advance which break is which; you run all three, look at the three dashboards, and write down which signature matches which cause. Then you check.

If your diagnoses are 3/3, you've earned the instrument. If ⅔, re-read the dashboard panels and try again. If ⅓, re-read this entire phase.

Why the regular-vs-irregular panel matters¶

Per §A13, the corpus has 12 regular verbs and 8 irregular verbs. The regular conjugation rule is one shared pattern:

work → works (3sg) → worked (past) → worked (pp) → will work (future) → going to work
play → plays → played → played → will play → going to play
...

12 verbs × 5 tenses × 3 persons = 180 regular forms — all generable from the one rule.

The irregular verbs each have their own pattern:

be → am / is / are → was / were → been → will be → going to be
have → has → had → had → will have → going to have
go → goes → went → gone → will go → going to go
...

8 verbs × 5 tenses × 3 persons = 120 irregular forms — each verb largely independent.

A model that has truly learned the corpus should:

Crush the regulars by step ~500. The rule is simple; learning it once unlocks 180 forms.
Crawl through the irregulars over ~1500 steps. Each verb is largely its own memorization task.

If the regular loss and irregular loss drop in lockstep, the model is memorizing — there's no generalization happening. If the irregular loss drops first (which would be weird), there's a data leak somewhere.

The healthy pattern: regular loss drops fast, irregular loss lags behind, the gap widens for ~500 steps, then narrows as the model memorizes irregulars. The narrowing slope and the asymptotic gap together tell you "is this model learning rules + exceptions, or is it pure memorization?". Phase 20's eval harness will quantify this for held-out cases; Phase 19's dashboard lets you see it during training.

The five state machines, instrumented¶

Phase 18 introduced the five state machines. Phase 19 attaches a probe to each:

Model — per-layer activation/weight stats via forward hooks.
Optimizer — gradient norm, per-parameter gradient stats via backward hooks.
Scheduler — LR plotted alongside loss; a flat LR through "warmup" is bug #2's signature.
Data iterator — per-class loss decomposition is the iterator-side probe.
Training-control RNG — recorded in the manifest; not actively plotted.

Each probe is cheap (Welford running stats, not full tensor dumps). The total overhead must stay ≤ 30%; the lab measures and enforces this.

What this phase is not about¶

Tuning. No hyperparameter sweeps. Three engineered breaks ≠ exploration. The dashboard tells you what's broken; Phase 20 measures quality; Phase 28 tunes (with LoRA).
Live monitoring. No streaming dashboards. Each training run produces one HTML file at the end. Live monitoring requires a server; we don't have one in Phase 19.
Beauty contest. The HTML file is functional, not artistic. Seven matplotlib panels in a vertical scroll. Don't spend hours on CSS.

The shape of the phase¶

Theory 01 is the panel catalog: what each panel shows and what bugs each panel surfaces.
Theory 02 is the math: spectral norm via power iteration, dead-neuron criterion, grad-to-weight ratio, loss-spike detector, per-class decomposition formula.
Theory 03 is the three failure modes: each one's dashboard signature, the exact mechanism that produces it, and how to fix it once diagnosed.
Lab 00 writes the hooks and measures their overhead.
Lab 01 renders the dashboard.
Lab 02 runs the three breaks, you diagnose them.
Lab 03 overfits on purpose to see the train/val gap and stabilize the regular/irregular gap.

Plan ~8–12 study hours over 4–5 sessions.

Stop here if¶

You're tempted to skip the diagnosis lab and read the solution first. The solution is deliberately gated. The phase's only durable value is in the moment you look at three dashboards, write three guesses, and find out you were right (or wrong, with a specific lesson). Peeking discards that value.

What this phase does NOT cover¶

Eval / model quality. Phase 20. Phase 19 is training-time only.
Generation behavior. Phase 21. Decoding policy is its own diagnostic problem.
Loss-landscape visualization. Out of scope (would require evaluating loss at many parameter perturbations).
GPU profiling. Phase 23 / 24 / 27.

Next: theory/01-what-to-instrument.md.