Skip to content

English · Español

Lab 03 — Train past convergence; see the train/val gap open

Goal: extend training past convergence on the tiny corpus and observe the train/val gap in the dashboard. Overfitting is a feature here, not a bug.

Estimated time: 1-2 hours (1 long training run + analysis).

Prereq: Lab 02 committed (dashboard validated against engineered failures).


What you produce

A directory experiments/19-overfit/ containing:

  • train.py — modified Phase-18 driver with T = 4 × Phase_18_T, no early-stopping.
  • config.yaml — same as healthy, except T is much larger.
  • inspector.log.jsonl.
  • dashboard.html — long-running dashboard showing overfit progression.
  • gap-analysis.md — your written analysis of where and how overfitting manifests.
  • manifest.json.

TODOs

Block A — set up the long training run

In experiments/19-overfit/config.yaml:

training:
  batch_size: 32
  T: 8000              # 4× Phase 18's 2000
  warmup: 100
  lr_max: 3e-3
  lr_min: 3e-4
  weight_decay: 0.1
  grad_clip: 1.0
  betas: [0.9, 0.95]
  eps: 1.0e-8
  eval_every: 50
  ckpt_every: 500

The schedule still cosine-decays over the full T=8000; LR at step 2000 is no longer the minimum (it was for Phase 18), it's the cosine-midpoint instead. That's part of the experiment.

Block B — run training and produce the dashboard

just run-overfit

Wall-clock expectation: ~2-3 hours on Borja's i5-8250U. Use a session manager (tmux, screen) so it survives terminal close.

After training, render dashboard.html with the standard dashboard.render(...).

Block C — find the overfit onset

Open dashboard.html. In Panel 1 (loss curves):

  1. Where does val loss bottom out? Record t_val_min and val_loss(t_val_min).
  2. Where does train loss bottom out? Record t_train_min and train_loss(t_train_min).
  3. The overfit onset \(t^*\) is approximately t_val_min. After \(t^*\), val loss climbs (or plateaus while train keeps falling).

Verify with the gap_t computed by the dashboard's overfit-onset detector (from theory/02's formula). It should match your visual reading within ±100 steps.

Block D — write gap-analysis.md

In experiments/19-overfit/gap-analysis.md, 3-5 paragraphs:

  1. The numbers. State t_val_min, val_loss(t_val_min), train_loss(t_train_min), the final gap at T=8000.

  2. What the dashboard tells you. Beyond Panel 1, what changes? Do dead-neuron counts increase (overfitting often "forgets" rare patterns, killing the heads that handle them)? Do per-layer activation magnitudes drift?

  3. Implications for Phase 20. When the eval harness arrives (next phase), the "best checkpoint" is going to be at t_val_min, not at T. Phase 18 saved checkpoints every 500 steps — confirm experiments/19-overfit/ has a checkpoint at or near t_val_min available for Phase 20's eval comparisons.

  4. Why this is healthy. Overfitting on a tiny corpus is a probe, not a failure mode. State, in one sentence, what the train/val gap reveals about model capacity.

  5. What would prevent overfitting? Three options: more data (Phase 12 expansion), more regularization (weight decay, dropout), or earlier stopping (which we'd configure with the Phase-20 eval). Don't implement any of them; just state which you'd reach for first and why.

Constraints

  • No early stopping. The whole point is to train through and past the overfit onset.
  • No regularization tuning. Same weight_decay as Phase 18.
  • Single config. Don't sweep over T. One number, one experiment.

Stop conditions

Done when:

  1. experiments/19-overfit/dashboard.html shows train loss continuing to drop while val loss has bottomed out or risen.
  2. gap-analysis.md answers all five paragraphs above.
  3. The best-val checkpoint (at or near t_val_min) is preserved (not overwritten by later checkpoints).

Pitfalls

  • The dashboard sometimes shows val loss going down very slowly past \(t^*\). That's not a contradiction — slower than train means a widening gap. Use the gap detector, not eyeball.
  • Forgetting to extend the schedule. If you keep T = 2000 in the schedule but run for 8000 steps, the schedule clamps at lr_min for the last 6000 steps. The experiment is still valid but the dashboard's LR panel will look weird. Either way, document what you did.
  • Memory pressure on a long run. The Inspector's log file grows linearly with steps. For T=8000, expect ~50-100 MB of JSONL. Acceptable on Borja's 62 GiB box.

When to consult solutions/

After gap-analysis.md is committed. The solution at solutions/03-overfit-on-purpose-ref.md (written at phase open) compares your gap numbers to the reference and discusses what Phase 20 will do with this data.


End of Phase 19 labs. Next phase: docs/phase-20-evaluation-harness/.