English · Español

Phase 19 — Training Dynamics & Debugging¶

Requires: 18 — Training Loop, Mixed Precision Preview, Checkpointing Teaches: instrumentation · hooks · gradient-norms · loss-curves · debugging Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 La fase 18 te enseñó a entrenar. Esta te enseña a mirar. Tres bugs introducidos a propósito + un dashboard HTML estático + el desglose regular-vs-irregular = la diferencia entre "el modelo entrena" y "puedo demostrarlo".

Goal¶

Build instrumentation that turns the Phase-18 training loop from "runs to completion" into "produces a forensic record of what happened inside." Then validate the instrumentation by (a) introducing three engineered failures and diagnosing each from the diagnostic dashboard alone, (b) extending training past convergence to see the train/val gap open, and © decomposing loss by verb class to observe how the model learns regular verbs (12 of them, -ed past, -s 3^rd-singular) before it learns irregular verbs (8 of them: be, have, do, go, come, see, eat, write).

By the end of Phase 19, Borja has the looking apparatus that the rest of the curriculum (Phases 20–22, then again 26–28) leans on every time a training run misbehaves.

Read order¶

theory/00-motivation.md — why "training looks fine" is the worst sentence in ML.
theory/01-what-to-instrument.md — the seven panels of the dashboard, and what each one reveals.
theory/02-dashboard-metrics.md — math for each metric: spectral norm, dead neurons, grad/activation ratios, loss-spike detection, per-class decomposition.
theory/03-three-failure-modes.md — anatomy of bad init, missing warmup, and broken causal mask: what each one should look like on the dashboard.
lab/00-instrument-hooks.md — write the forward/backward hooks; verify overhead ≤ 30%.
lab/01-build-dashboard.md — render the streaming stats to a self-contained HTML file with the regular-vs-irregular panel.
lab/02-break-it.md — run the three engineered breaks; diagnose each from the dashboard before peeking.
lab/03-overfit-on-purpose.md — extend training past convergence; see the train/val gap open and the regular/irregular tax stabilize.

solutions/ is empty during pre-write — populated at phase open, after Borja has committed his diagnoses.

Definition of Done¶

See PHASE_19_PLAN.md §6. Briefly:

experiments/19-healthy/dashboard.html shows the reference pattern.
Three diagnoses committed in experiments/19-break-it/borja-diagnoses.md before consulting solutions.
Diagnosis accuracy on the three breaks recorded honestly in PHASE_19_REPORT.md.
The instrumented training loop's overhead is ≤ 30%.
The regular-vs-irregular loss panel is visible in at least the healthy dashboard.

What this phase intentionally does NOT cover¶

Eval beyond perplexity / loss. Phase 20. Phase 19 is about dynamics during training, not quality of the final model.
Hyperparameter search. A real sweep belongs in Phase 28's LoRA work; the three breaks here are engineered failures, not random tuning.
PyTorch. Phase 24.
Distributed monitoring. Phase 35 + Phase 34's observability story.
Fixing the breaks beyond noting the fix. Diagnosing is the lesson; re-running with the fix is a one-liner.
Sampling-time analysis. Phase 21 looks at decoding-policy behavior; Phase 19 is training-time only.

Phase 19's scope is: the dashboard + three diagnoses + the regular/irregular tax visualization. That's all.