English · Español

Lab 01 — First real training run; beat the n-gram baseline¶

Goal: train MiniGPT (Phase 17) on the Phase-12 verb-conjugation corpus until val perplexity on the held-out-4-verbs split beats the Phase-14 n-gram baseline.

Estimated time: 3–5 hours including a 60–90-minute training run on the i5-8250U.

Prereq: lab 00 done; Phase 14 baseline PPL committed in experiments/14-ngram-baseline/results.json.

What you produce¶

src/minitrain/loop.py — the training step + epoch loop.
scripts/train_mini.py — CLI entrypoint: takes a config YAML, runs training, writes checkpoints + manifest.
experiments/18-train-mini/:
manifest.json — {seed, versions, config, git_sha, data_manifest_hash, hardware}
config.yaml — resolved hyperparameters
train_log.jsonl — per-step metrics (loss, lr, grad_norm)
loss_curve.png — train + val loss, with n-gram baseline as horizontal line
results.json — final train PPL, val PPL, best-val step, comparison to baseline
models/minigpt-phase18-<hash>.safetensors — final checkpoint (lab 02 verifies round-trip)

Background you must have read¶

theory/00-motivation.md — the five state machines.
theory/02-optimizer-and-schedule.md — AdamW + warmup + cosine + clipping.

TODOs¶

Block A — write `src/minitrain/loop.py`¶

The canonical structure:

def train(
    model, optimizer, scheduler, train_iter, val_iter,
    total_steps, val_every, log_every,
    rng_seed, manifest_path,
) -> dict:
    """Returns final metrics dict; writes log_file along the way."""
    set_seed(rng_seed)  # training-control RNG, NOT the data RNG
    write_manifest_pre_train(manifest_path, ...)

    for step in range(total_steps):
        batch = next(train_iter)
        logits, loss = model.forward(batch.input, batch.target, batch.attn_pad_mask, batch.loss_mask)
        grads = model.backward()
        g_norm = global_grad_norm(grads)
        optimizer.step(model.params, grads)
        scheduler.step()

        if step % log_every == 0:
            log({"step": step, "loss": loss, "lr": scheduler.lr, "g_norm": g_norm})

        if step % val_every == 0:
            val_loss, val_ppl = evaluate(model, val_iter)
            log({"step": step, "val_loss": val_loss, "val_ppl": val_ppl})

    return final_metrics

Implement evaluate(model, val_iter): run the whole val set in eval mode (no gradients), return per-token mean loss and exp(mean_loss) as PPL.
Implement global_grad_norm(grads): returns sqrt(sum((g*g).sum() for g in grads)).
Insert clipping before optimizer.step. Use theory/02 §"Gradient clipping" implementation.
Log every step to train_log.jsonl as one JSON object per line.

Block B — write `scripts/train_mini.py`¶

A thin CLI:

python scripts/train_mini.py \
    --config experiments/18-train-mini/config.yaml \
    --out-dir experiments/18-train-mini/ \
    --seed 42

Load YAML config.
Resolve hardware info (CPU model, RAM) for the manifest.
Set seeds (Python, NumPy) and record the resolved RNG seed.
Build tokenizer, corpus loader, batcher (lab 00), MiniGPT (Phase 17), AdamW (Phase 9), scheduler.
Call train(...).
Save final checkpoint via src/minitrain/checkpoint.py (lab 02 implements it).
Save plot, results.json.

Block C — config¶

experiments/18-train-mini/config.yaml — starting defaults:

model:
  n_layer: 2       # Phase 17 locked config
  n_head: 4
  d_model: 64
  d_ff: 256
  vocab_size: 64   # Phase 17 locked config
  max_seq_len: 32  # verb sentences are short

optimizer:
  type: adamw
  lr_max: 3e-4
  lr_min: 3e-5
  beta1: 0.9
  beta2: 0.95
  eps: 1e-8
  weight_decay: 0.1
  clip: 1.0

scheduler:
  warmup_steps: 100
  total_steps: 2000

training:
  batch_size: 8
  val_every: 50
  log_every: 10
  loss_reduction: token_mean
  rng_seed: 42
  data_seed: 7

data:
  train_path: data/processed/train.jsonl
  val_path:   data/processed/val.jsonl
  heldout_policy: verbs_4

The numbers are starting points. Adjust if training is unstable, but document the change in experiments/18-train-mini/README.md.

Block D — run training¶

Run python scripts/train_mini.py .... Expected wall-clock on the i5-8250U: 60–90 minutes for 2000 steps.
Monitor train_log.jsonl during the run. The loss should drop from ~6 at step 0 to ~1.5–2.5 at step 2000.
If loss plateaus above 4.0, stop the run — something is wrong. Debug before scaling up.
If loss goes NaN: stop, save the manifest, document in experiments/18-train-mini/README.md, debug. (Phase 19 will systematize this with the dashboard.)

Block E — plot and report¶

Generate loss_curve.png with two lines (train, val), a horizontal dashed line at the Phase-14 n-gram baseline PPL.
Compute final metrics: train_ppl, val_ppl, best_val_step, best_val_ppl, n_gram_baseline_val_ppl.
Write results.json:

{
  "train_ppl": 4.5,
  "val_ppl": 9.0,
  "best_val_step": 1850,
  "best_val_ppl": 8.8,
  "ngram_baseline_val_ppl": 13.0,
  "beats_baseline": true,
  "ppl_improvement_pct": 30.8
}

The DoD gate is val_ppl < ngram_baseline_val_ppl.

Block F — narrative¶

In experiments/18-train-mini/README.md, write 200-400 words covering:

The headline PPL numbers vs baseline.
The shape of the loss curve (smooth descent, plateaus, spikes).
Any deviations from the default config and why.
Train-vs-val gap at the end (which would suggest overfitting — explored more in Phase 19).
Wall-clock time and peak RAM.

Constraints¶

Pure NumPy + hand-built minitorch. No PyTorch.
Train RNG seed and data RNG seed are distinct. A reload must restore both.
No early stopping based on val PPL during this lab. Run the full 2000 steps so the loss-curve shape is fully visible. Early stopping comes in Phase 19/20.

Stop conditions¶

Done when:

experiments/18-train-mini/manifest.json is committed and complete.
loss_curve.png is committed; visually shows val PPL dropping below the baseline line.
results.json has beats_baseline: true.
You can answer: "what would have made my val PPL worse — and why?" (At least two specific answers.)

Pitfalls¶

Loss curve perfectly flat. Check: is the optimizer actually stepping? Add a np.linalg.norm(params['layer0.W']) log every step; it should change.
Loss explodes after step ~100. Probably the warmup ended and the full LR is too large. Halve lr_max.
Val PPL flat, train PPL drops. Mask leakage. Check that the loss mask is masking the target, not the input.
Train PPL flat, val PPL drops. Probably your train and val splits are swapped. Verify lab 00's split test.
RAM blows up. A pure-NumPy autograd holds the forward activations for the backward pass. With n_layer=2, d_model=64, batch_size=8, seq_len=32 you should peak under 4 GB. If above, reduce batch size.

When to consult `solutions/`¶

After results.json shows beats_baseline: true. Solution at solutions/01-train-mini-ref.md (written at phase open). The solution discusses what "good" loss curves look like for this specific corpus and the kinds of plateau patterns to expect.

Next lab: lab/02-checkpoint-roundtrip.md.