English · Español

00 — Why the training loop is a correctness exercise¶

🇪🇸 La tentación en esta fase es tunear hiperparámetros. Es un error. El 90% del trabajo aquí es que el loop sea correcto y reproducible — batching, máscaras, reducción de loss, schedule, checkpoints. Una vez correcto, los hiperparámetros son una tarde de barridos. Antes de ser correcto, los hiperparámetros son ruido sobre ruido.

The trap¶

When somebody opens "training loop" in a tutorial, they expect to learn how to make the model better. Bigger batches, better LR, more warmup, fancy schedules, exponential moving averages. This intuition is wrong in the most important sense: a training loop's job is to be correct, not to be optimal.

A correct loop trains the model that the math says you should be training. An incorrect loop trains a different model — one that's silently 5% worse, or that diverges on Tuesdays, or whose checkpoint, when reloaded, predicts slightly different logits. Every minute spent tuning an incorrect loop is wasted. Every later phase that builds on top (debugging dynamics in 19, evaluating quality in 20, sampling in 21, optimizing inference in 22) is poisoned by the foundation.

Phase 18 is a correctness phase. The deliverable is:

A training loop that you have personally verified is doing exactly what the math says, with every state-bearing component (weights, optimizer moments, scheduler step, RNG, data iterator) serialized and restorable.

Hyperparameters come later. So does winning.

The five state machines¶

A training loop is the interleaving of five state machines:

The model. Parameters \(\theta\). Updated once per step.
The optimizer. AdamW maintains \(m_t, v_t, t\) per parameter. Updated once per step.
The scheduler. Holds the current step \(t\) and the schedule shape. Updated once per step.
The data iterator. Holds the current epoch + position + RNG seed for the shuffle. Updated once per batch.
The training-control RNG. Used for dropout, augmentation, sampling. Distinct from the data-shuffle RNG.

A checkpoint that saves only \(\theta\) is a broken checkpoint: reloading it gives the same model but a different optimizer trajectory, a different LR, a different next batch, a different dropout pattern. The training run after a reload is not the same run as the training run before. This is the deep meaning of "byte-equivalent reload" in the DoD: every one of the five state machines is restored, so the run produced byte-for-byte across a checkpoint boundary matches the run that never checkpointed.

If you can articulate the five state machines and where each lives in your loop, you've passed the conceptual gate of Phase 18.

Why the verb-grammar corpus is the right test bed¶

The Phase-12 corpus is 600 forms total: 20 verbs × 5 tenses × 3 persons, with English ↔ Spanish pairs. Three properties make it the right test bed for Phase 18:

Small enough that bugs are visible. If your loss curve has a 0.3-ppl bump because your loss reduction is sum not mean, you'll see it in 100 steps, not 10000. On a 1B-token corpus that bump hides in noise.
Structured enough that the model can succeed. The grid (verb × tense × person) has internal regularity — regular verbs follow -ed, irregulars don't. A correct loop on a tiny model will visibly learn the pattern. An incorrect loop will get stuck.
Held-out generalization is meaningful. If you hold out four verbs entirely, the model has to generalize from the others. This is exactly what Phase 20's evaluation harness tests, in pre-form, during training: does PPL on the held-out verbs drop, or only on the train verbs?

These three properties are why the curriculum runs Phase 18 on grammar, not on Shakespeare or Wikipedia. You need to see the loop's bugs. A bigger corpus would hide them.

What "beats baseline" means here¶

The Phase-14 n-gram baseline is a token-level model with no notion of "verb" or "tense". It scores PPL on the val set by pure count statistics. On a corpus this regular, the n-gram baseline is actually fairly strong: bigrams catch to + walk → walked, trigrams catch person agreement (he + walk + s). A naive transformer with a bad loss reduction or a missing mask will lose to the n-gram baseline.

That's the point. The DoD says "beat baseline." If your loop is correct, you beat it by ~30% perplexity reduction (target val PPL roughly 0.7× the n-gram baseline). If your loop is incorrect, you don't beat it, and the gate refuses to advance until you find the bug. The baseline is a correctness oracle, not a vanity target.

What this phase asks you to internalize¶

Five non-negotiable habits going forward:

Every training run has a manifest. seed, versions, config, git_sha, data_manifest_hash. Committed before the run starts, not retrofitted.
Loss reduction is a choice and you write it down. "I reduce by mean over (B, L) after masking padding." If you can't say which, your loss isn't doing what you think.
Checkpoints are atomic + manifested. No half-written .safetensors files. No "I'll add the seed to the JSON next time."
Mixed precision is mathematics first, optimization second. You learn the fp16 dynamic range before you learn the loss-scaling trick.
Pickle never touches model weights. Safetensors only. The hygiene case is in theory/04-checkpoints-and-mlflow.md and security/supply-chain.md.

Internalize these once in Phase 18; every later phase that touches training assumes them.

The shape of the phase¶

Theory 01 is the longest: batching, padding, attention masks, loss-reduction conventions. The bulk of the bugs in training loops live in one of those four. Read it twice.
Theory 02 consolidates Phase 4's optimizer math into the implementation reference — what you'll re-derive from memory when you write loop.py.
Theory 03 is the mixed-precision preview. Short, dense, no actual mixed-precision training. Phase 26 lands the real thing.
Theory 04 is checkpoints + mlflow. The shortest practical content but the most security-relevant: pickle is forbidden, and the reason is mechanical (it's RCE).
Lab 00 assembles batches from the verb corpus.
Lab 01 runs the first real training and beats baseline.
Lab 02 proves the checkpoint round-trip is bit-identical.
Lab 03 measures fp16 drift.
Lab 04 wires mlflow.

Plan ~12–18 study hours over 6–8 sessions.

Stop here if¶

You're tempted to start coding before reading theory/01. The single most common Phase-18 bug — confusing per-token, per-sequence, and per-batch loss reduction — is fixed in 30 seconds if the math is fresh, and takes 6 hours of debugging if it isn't. Read first.

What this phase does NOT cover¶

Tuning. No grid searches, no Bayesian opt, no hyperparameter sweeps.
Real mixed-precision training. Forward-only preview. Backward in fp16 + loss scaling lives in Phase 26.
Multi-GPU / distributed training. Phase 35.
Eval metrics beyond perplexity. Phase 20. Phase 18 uses PPL as the correctness oracle (vs n-gram baseline), not as a quality measure.
Sampling strategies (top-k, top-p, etc.). Phase 21.
PyTorch. Phase 24. Phase 18 uses pure NumPy + hand-built minitorch tensors.

Next: theory/01-batching-loss-mask.md.