Skip to content

English · Español

Lab 01 — First real training run; beat the n-gram baseline

Goal: train MiniGPT (Phase 17) on the Phase-12 verb-conjugation corpus until val perplexity on the held-out-4-verbs split beats the Phase-14 n-gram baseline.

Estimated time: 3–5 hours including a 60–90-minute training run on the i5-8250U.

Prereq: lab 00 done; Phase 14 baseline PPL committed in experiments/14-ngram-baseline/results.json.


What you produce

  • src/minitrain/loop.py — the training step + epoch loop.
  • scripts/train_mini.py — CLI entrypoint: takes a config YAML, runs training, writes checkpoints + manifest.
  • experiments/18-train-mini/:
  • manifest.json{seed, versions, config, git_sha, data_manifest_hash, hardware}
  • config.yaml — resolved hyperparameters
  • train_log.jsonl — per-step metrics (loss, lr, grad_norm)
  • loss_curve.png — train + val loss, with n-gram baseline as horizontal line
  • results.json — final train PPL, val PPL, best-val step, comparison to baseline
  • models/minigpt-phase18-<hash>.safetensors — final checkpoint (lab 02 verifies round-trip)

Background you must have read

  • theory/00-motivation.md — the five state machines.
  • theory/02-optimizer-and-schedule.md — AdamW + warmup + cosine + clipping.

TODOs

Block A — write src/minitrain/loop.py

The canonical structure:

def train(
    model, optimizer, scheduler, train_iter, val_iter,
    total_steps, val_every, log_every,
    rng_seed, manifest_path,
) -> dict:
    """Returns final metrics dict; writes log_file along the way."""
    set_seed(rng_seed)  # training-control RNG, NOT the data RNG
    write_manifest_pre_train(manifest_path, ...)

    for step in range(total_steps):
        batch = next(train_iter)
        logits, loss = model.forward(batch.input, batch.target, batch.attn_pad_mask, batch.loss_mask)
        grads = model.backward()
        g_norm = global_grad_norm(grads)
        optimizer.step(model.params, grads)
        scheduler.step()

        if step % log_every == 0:
            log({"step": step, "loss": loss, "lr": scheduler.lr, "g_norm": g_norm})

        if step % val_every == 0:
            val_loss, val_ppl = evaluate(model, val_iter)
            log({"step": step, "val_loss": val_loss, "val_ppl": val_ppl})

    return final_metrics
  • Implement evaluate(model, val_iter): run the whole val set in eval mode (no gradients), return per-token mean loss and exp(mean_loss) as PPL.
  • Implement global_grad_norm(grads): returns sqrt(sum((g*g).sum() for g in grads)).
  • Insert clipping before optimizer.step. Use theory/02 §"Gradient clipping" implementation.
  • Log every step to train_log.jsonl as one JSON object per line.

Block B — write scripts/train_mini.py

A thin CLI:

python scripts/train_mini.py \
    --config experiments/18-train-mini/config.yaml \
    --out-dir experiments/18-train-mini/ \
    --seed 42
  • Load YAML config.
  • Resolve hardware info (CPU model, RAM) for the manifest.
  • Set seeds (Python, NumPy) and record the resolved RNG seed.
  • Build tokenizer, corpus loader, batcher (lab 00), MiniGPT (Phase 17), AdamW (Phase 9), scheduler.
  • Call train(...).
  • Save final checkpoint via src/minitrain/checkpoint.py (lab 02 implements it).
  • Save plot, results.json.

Block C — config

experiments/18-train-mini/config.yaml — starting defaults:

model:
  n_layer: 2       # Phase 17 locked config
  n_head: 4
  d_model: 64
  d_ff: 256
  vocab_size: 64   # Phase 17 locked config
  max_seq_len: 32  # verb sentences are short

optimizer:
  type: adamw
  lr_max: 3e-4
  lr_min: 3e-5
  beta1: 0.9
  beta2: 0.95
  eps: 1e-8
  weight_decay: 0.1
  clip: 1.0

scheduler:
  warmup_steps: 100
  total_steps: 2000

training:
  batch_size: 8
  val_every: 50
  log_every: 10
  loss_reduction: token_mean
  rng_seed: 42
  data_seed: 7

data:
  train_path: data/processed/train.jsonl
  val_path:   data/processed/val.jsonl
  heldout_policy: verbs_4

The numbers are starting points. Adjust if training is unstable, but document the change in experiments/18-train-mini/README.md.

Block D — run training

  • Run python scripts/train_mini.py .... Expected wall-clock on the i5-8250U: 60–90 minutes for 2000 steps.
  • Monitor train_log.jsonl during the run. The loss should drop from ~6 at step 0 to ~1.5–2.5 at step 2000.
  • If loss plateaus above 4.0, stop the run — something is wrong. Debug before scaling up.
  • If loss goes NaN: stop, save the manifest, document in experiments/18-train-mini/README.md, debug. (Phase 19 will systematize this with the dashboard.)

Block E — plot and report

  • Generate loss_curve.png with two lines (train, val), a horizontal dashed line at the Phase-14 n-gram baseline PPL.
  • Compute final metrics: train_ppl, val_ppl, best_val_step, best_val_ppl, n_gram_baseline_val_ppl.
  • Write results.json:
{
  "train_ppl": 4.5,
  "val_ppl": 9.0,
  "best_val_step": 1850,
  "best_val_ppl": 8.8,
  "ngram_baseline_val_ppl": 13.0,
  "beats_baseline": true,
  "ppl_improvement_pct": 30.8
}

The DoD gate is val_ppl < ngram_baseline_val_ppl.

Block F — narrative

In experiments/18-train-mini/README.md, write 200-400 words covering:

  • The headline PPL numbers vs baseline.
  • The shape of the loss curve (smooth descent, plateaus, spikes).
  • Any deviations from the default config and why.
  • Train-vs-val gap at the end (which would suggest overfitting — explored more in Phase 19).
  • Wall-clock time and peak RAM.

Constraints

  • Pure NumPy + hand-built minitorch. No PyTorch.
  • Train RNG seed and data RNG seed are distinct. A reload must restore both.
  • No early stopping based on val PPL during this lab. Run the full 2000 steps so the loss-curve shape is fully visible. Early stopping comes in Phase 19/20.

Stop conditions

Done when:

  1. experiments/18-train-mini/manifest.json is committed and complete.
  2. loss_curve.png is committed; visually shows val PPL dropping below the baseline line.
  3. results.json has beats_baseline: true.
  4. You can answer: "what would have made my val PPL worse — and why?" (At least two specific answers.)

Pitfalls

  • Loss curve perfectly flat. Check: is the optimizer actually stepping? Add a np.linalg.norm(params['layer0.W']) log every step; it should change.
  • Loss explodes after step ~100. Probably the warmup ended and the full LR is too large. Halve lr_max.
  • Val PPL flat, train PPL drops. Mask leakage. Check that the loss mask is masking the target, not the input.
  • Train PPL flat, val PPL drops. Probably your train and val splits are swapped. Verify lab 00's split test.
  • RAM blows up. A pure-NumPy autograd holds the forward activations for the backward pass. With n_layer=2, d_model=64, batch_size=8, seq_len=32 you should peak under 4 GB. If above, reduce batch size.

When to consult solutions/

After results.json shows beats_baseline: true. Solution at solutions/01-train-mini-ref.md (written at phase open). The solution discusses what "good" loss curves look like for this specific corpus and the kinds of plateau patterns to expect.


Next lab: lab/02-checkpoint-roundtrip.md.