Skip to content

English · Español

Phase 18 — Training Loop, Checkpointing, Mixed-Precision Preview

Requires: 12 — The Corpus: Designing the Microscopic Dataset · 17 — Tiny Transformer Block & Mini-GPT Teaches: training-loop · batching · adamw · warmup · cosine-decay · checkpointing Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 La fase donde el modelo por fin aprende. El bucle de entrenamiento es prosa de bajo glamour: batching correcto, una loss bien reducida, un schedule honesto, checkpoints que se cargan idénticos. Si esto no es sólido, las fases 19-22 son depuración eterna.


Goal

Train MiniGPT (from Phase 17) on the Phase-12 corpus of English verb conjugations + Spanish translations, beat the Phase-14 n-gram baseline on the held-out (verb, tense, person) split, save the result as a reloadable safetensors checkpoint, and log the run to mlflow (installed for the first time in this phase per LYNX_CORTEX_ADDENDUM.md §A8).

By the end of Phase 18, Borja can explain — with code he wrote — why every line in src/minitrain/loop.py exists, what would break if it were removed, and how the optimizer state, scheduler state, RNG state, and model weights together form a reproducible training run.

Read order

  1. theory/00-motivation.md — why the training loop is a correctness exercise, not a tuning one.
  2. theory/01-batching-loss-mask.md — batching variable-length sentences, padding, attention masks, loss reduction conventions.
  3. theory/02-optimizer-and-schedule.md — AdamW + warmup + cosine decay, derived from Phase 4's groundwork.
  4. theory/03-mixed-precision-preview.md — fp16/bf16 mathematics, why we don't train in mp on this hardware, what Phase 26 will deepen.
  5. theory/04-checkpoints-and-mlflow.md — safetensors over pickle, manifest discipline, how mlflow wraps (not replaces) it.
  6. lab/00-batch-and-mask.md — assemble batches from the verb-conjugation JSONL shards with correct masks.
  7. lab/01-train-mini.md — first real training run; beat the n-gram baseline.
  8. lab/02-checkpoint-roundtrip.md — save → reload → forward; assert fp64 byte-equivalence.
  9. lab/03-mp-drift.md — measure fp16 numerical drift on the forward pass.
  10. lab/04-mlflow-wiring.md — wrap the existing manifest discipline with mlflow, do not replace it.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase 17 MiniGPT API is fixed.

Definition of Done

See PHASE_18_PLAN.md §6. Briefly:

  • models/minigpt-phase18-<hash>.safetensors exists and reloads byte-equivalently.
  • Val perplexity from experiments/18-train-mini/ beats the Phase-14 n-gram baseline on the held-out (verb, tense, person) split.
  • One mlflow run exists locally; URI recorded in PHASE_18_REPORT.md.
  • Per-layer fp16-drift plot exists from experiments/18-mp-drift/.
  • No pickle import in src/minitrain/.

What this phase intentionally does NOT cover

  • PyTorch. Phase 24. Training stays pure NumPy + hand-built minitorch + safetensors + mlflow.
  • Real mixed-precision training. Phase 26. We compute drift, we do not optimize through it.
  • Distributed training. Phase 35. Single-process only.
  • Hyperparameter search. Phase 18 trains one configuration that beats baseline. Tuning is a Phase 19 dynamics exercise.
  • LoRA / PEFT. Phase 28. Full-parameter training only.
  • Eval beyond perplexity. Phase 20. Here, perplexity vs baseline is the only metric that decides DoD.
  • KV cache. Phase 22. Training never uses one; inference does.
  • Sampling / generation quality. Phase 21. Training optimizes the log-likelihood; the decoding policy is the next phase.

Phase 18's scope is the first reproducible, observable, checkpointable training loop for MiniGPT on the verb-grammar corpus. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.