English · Español
Phase 18 — Training Loop, Checkpointing, Mixed-Precision Preview¶
Requires: 12 — The Corpus: Designing the Microscopic Dataset · 17 — Tiny Transformer Block & Mini-GPT Teaches:
training-loop·batching·adamw·warmup·cosine-decay·checkpointingJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 La fase donde el modelo por fin aprende. El bucle de entrenamiento es prosa de bajo glamour: batching correcto, una loss bien reducida, un schedule honesto, checkpoints que se cargan idénticos. Si esto no es sólido, las fases 19-22 son depuración eterna.
Goal¶
Train MiniGPT (from Phase 17) on the Phase-12 corpus of English verb conjugations + Spanish translations, beat the Phase-14 n-gram baseline on the held-out (verb, tense, person) split, save the result as a reloadable safetensors checkpoint, and log the run to mlflow (installed for the first time in this phase per LYNX_CORTEX_ADDENDUM.md §A8).
By the end of Phase 18, Borja can explain — with code he wrote — why every line in src/minitrain/loop.py exists, what would break if it were removed, and how the optimizer state, scheduler state, RNG state, and model weights together form a reproducible training run.
Read order¶
theory/00-motivation.md— why the training loop is a correctness exercise, not a tuning one.theory/01-batching-loss-mask.md— batching variable-length sentences, padding, attention masks, loss reduction conventions.theory/02-optimizer-and-schedule.md— AdamW + warmup + cosine decay, derived from Phase 4's groundwork.theory/03-mixed-precision-preview.md— fp16/bf16 mathematics, why we don't train in mp on this hardware, what Phase 26 will deepen.theory/04-checkpoints-and-mlflow.md— safetensors over pickle, manifest discipline, how mlflow wraps (not replaces) it.lab/00-batch-and-mask.md— assemble batches from the verb-conjugation JSONL shards with correct masks.lab/01-train-mini.md— first real training run; beat the n-gram baseline.lab/02-checkpoint-roundtrip.md— save → reload → forward; assert fp64 byte-equivalence.lab/03-mp-drift.md— measure fp16 numerical drift on the forward pass.lab/04-mlflow-wiring.md— wrap the existing manifest discipline withmlflow, do not replace it.
solutions/ is empty during pre-write — populated at phase open after Borja's Phase 17 MiniGPT API is fixed.
Definition of Done¶
See PHASE_18_PLAN.md §6. Briefly:
models/minigpt-phase18-<hash>.safetensorsexists and reloads byte-equivalently.- Val perplexity from
experiments/18-train-mini/beats the Phase-14 n-gram baseline on the held-out (verb, tense, person) split. - One
mlflowrun exists locally; URI recorded inPHASE_18_REPORT.md. - Per-layer fp16-drift plot exists from
experiments/18-mp-drift/. - No
pickleimport insrc/minitrain/.
What this phase intentionally does NOT cover¶
- PyTorch. Phase 24. Training stays pure NumPy + hand-built minitorch +
safetensors+mlflow. - Real mixed-precision training. Phase 26. We compute drift, we do not optimize through it.
- Distributed training. Phase 35. Single-process only.
- Hyperparameter search. Phase 18 trains one configuration that beats baseline. Tuning is a Phase 19 dynamics exercise.
- LoRA / PEFT. Phase 28. Full-parameter training only.
- Eval beyond perplexity. Phase 20. Here, perplexity vs baseline is the only metric that decides DoD.
- KV cache. Phase 22. Training never uses one; inference does.
- Sampling / generation quality. Phase 21. Training optimizes the log-likelihood; the decoding policy is the next phase.
Phase 18's scope is the first reproducible, observable, checkpointable training loop for MiniGPT on the verb-grammar corpus. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Decoupled Weight Decay Regularization (AdamW) — Loshchilov & Hutter · 2017. the optimizer + decay the training loop actually uses.
- 📄 SGDR: Stochastic Gradient Descent with Warm Restarts — Loshchilov & Hutter · 2016. where the cosine LR schedule comes from.