English · Español
Lab 01 — First real training run; beat the n-gram baseline¶
Goal: train MiniGPT (Phase 17) on the Phase-12 verb-conjugation corpus until val perplexity on the held-out-4-verbs split beats the Phase-14 n-gram baseline.
Estimated time: 3–5 hours including a 60–90-minute training run on the i5-8250U.
Prereq: lab 00 done; Phase 14 baseline PPL committed in
experiments/14-ngram-baseline/results.json.
What you produce¶
src/minitrain/loop.py— the training step + epoch loop.scripts/train_mini.py— CLI entrypoint: takes a config YAML, runs training, writes checkpoints + manifest.experiments/18-train-mini/:manifest.json—{seed, versions, config, git_sha, data_manifest_hash, hardware}config.yaml— resolved hyperparameterstrain_log.jsonl— per-step metrics (loss, lr, grad_norm)loss_curve.png— train + val loss, with n-gram baseline as horizontal lineresults.json— final train PPL, val PPL, best-val step, comparison to baselinemodels/minigpt-phase18-<hash>.safetensors— final checkpoint (lab 02 verifies round-trip)
Background you must have read¶
theory/00-motivation.md— the five state machines.theory/02-optimizer-and-schedule.md— AdamW + warmup + cosine + clipping.
TODOs¶
Block A — write src/minitrain/loop.py¶
The canonical structure:
def train(
model, optimizer, scheduler, train_iter, val_iter,
total_steps, val_every, log_every,
rng_seed, manifest_path,
) -> dict:
"""Returns final metrics dict; writes log_file along the way."""
set_seed(rng_seed) # training-control RNG, NOT the data RNG
write_manifest_pre_train(manifest_path, ...)
for step in range(total_steps):
batch = next(train_iter)
logits, loss = model.forward(batch.input, batch.target, batch.attn_pad_mask, batch.loss_mask)
grads = model.backward()
g_norm = global_grad_norm(grads)
optimizer.step(model.params, grads)
scheduler.step()
if step % log_every == 0:
log({"step": step, "loss": loss, "lr": scheduler.lr, "g_norm": g_norm})
if step % val_every == 0:
val_loss, val_ppl = evaluate(model, val_iter)
log({"step": step, "val_loss": val_loss, "val_ppl": val_ppl})
return final_metrics
- Implement
evaluate(model, val_iter): run the whole val set in eval mode (no gradients), return per-token mean loss andexp(mean_loss)as PPL. - Implement
global_grad_norm(grads): returnssqrt(sum((g*g).sum() for g in grads)). - Insert clipping before
optimizer.step. Usetheory/02§"Gradient clipping" implementation. - Log every step to
train_log.jsonlas one JSON object per line.
Block B — write scripts/train_mini.py¶
A thin CLI:
python scripts/train_mini.py \
--config experiments/18-train-mini/config.yaml \
--out-dir experiments/18-train-mini/ \
--seed 42
- Load YAML config.
- Resolve hardware info (CPU model, RAM) for the manifest.
- Set seeds (Python, NumPy) and record the resolved RNG seed.
- Build tokenizer, corpus loader, batcher (lab 00), MiniGPT (Phase 17), AdamW (Phase 9), scheduler.
- Call
train(...). - Save final checkpoint via
src/minitrain/checkpoint.py(lab 02 implements it). - Save plot, results.json.
Block C — config¶
experiments/18-train-mini/config.yaml — starting defaults:
model:
n_layer: 2 # Phase 17 locked config
n_head: 4
d_model: 64
d_ff: 256
vocab_size: 64 # Phase 17 locked config
max_seq_len: 32 # verb sentences are short
optimizer:
type: adamw
lr_max: 3e-4
lr_min: 3e-5
beta1: 0.9
beta2: 0.95
eps: 1e-8
weight_decay: 0.1
clip: 1.0
scheduler:
warmup_steps: 100
total_steps: 2000
training:
batch_size: 8
val_every: 50
log_every: 10
loss_reduction: token_mean
rng_seed: 42
data_seed: 7
data:
train_path: data/processed/train.jsonl
val_path: data/processed/val.jsonl
heldout_policy: verbs_4
The numbers are starting points. Adjust if training is unstable, but document the change in experiments/18-train-mini/README.md.
Block D — run training¶
- Run
python scripts/train_mini.py .... Expected wall-clock on the i5-8250U: 60–90 minutes for 2000 steps. - Monitor
train_log.jsonlduring the run. The loss should drop from ~6 at step 0 to ~1.5–2.5 at step 2000. - If loss plateaus above 4.0, stop the run — something is wrong. Debug before scaling up.
- If loss goes NaN: stop, save the manifest, document in
experiments/18-train-mini/README.md, debug. (Phase 19 will systematize this with the dashboard.)
Block E — plot and report¶
- Generate
loss_curve.pngwith two lines (train, val), a horizontal dashed line at the Phase-14 n-gram baseline PPL. - Compute final metrics:
train_ppl,val_ppl,best_val_step,best_val_ppl,n_gram_baseline_val_ppl. - Write
results.json:
{
"train_ppl": 4.5,
"val_ppl": 9.0,
"best_val_step": 1850,
"best_val_ppl": 8.8,
"ngram_baseline_val_ppl": 13.0,
"beats_baseline": true,
"ppl_improvement_pct": 30.8
}
The DoD gate is val_ppl < ngram_baseline_val_ppl.
Block F — narrative¶
In experiments/18-train-mini/README.md, write 200-400 words covering:
- The headline PPL numbers vs baseline.
- The shape of the loss curve (smooth descent, plateaus, spikes).
- Any deviations from the default config and why.
- Train-vs-val gap at the end (which would suggest overfitting — explored more in Phase 19).
- Wall-clock time and peak RAM.
Constraints¶
- Pure NumPy + hand-built minitorch. No PyTorch.
- Train RNG seed and data RNG seed are distinct. A reload must restore both.
- No early stopping based on val PPL during this lab. Run the full 2000 steps so the loss-curve shape is fully visible. Early stopping comes in Phase 19/20.
Stop conditions¶
Done when:
experiments/18-train-mini/manifest.jsonis committed and complete.loss_curve.pngis committed; visually shows val PPL dropping below the baseline line.results.jsonhasbeats_baseline: true.- You can answer: "what would have made my val PPL worse — and why?" (At least two specific answers.)
Pitfalls¶
- Loss curve perfectly flat. Check: is the optimizer actually stepping? Add a
np.linalg.norm(params['layer0.W'])log every step; it should change. - Loss explodes after step ~100. Probably the warmup ended and the full LR is too large. Halve
lr_max. - Val PPL flat, train PPL drops. Mask leakage. Check that the loss mask is masking the target, not the input.
- Train PPL flat, val PPL drops. Probably your train and val splits are swapped. Verify lab 00's split test.
- RAM blows up. A pure-NumPy autograd holds the forward activations for the backward pass. With
n_layer=2, d_model=64, batch_size=8, seq_len=32you should peak under 4 GB. If above, reduce batch size.
When to consult solutions/¶
After results.json shows beats_baseline: true. Solution at solutions/01-train-mini-ref.md (written at phase open). The solution discusses what "good" loss curves look like for this specific corpus and the kinds of plateau patterns to expect.
Next lab: lab/02-checkpoint-roundtrip.md.