English · Español
Lab 02 — LoRA Fine-Tune on Irregular Verbs¶
🇪🇸 Esta es la práctica central de la fase. Especializar MiniGPT en conjugar 8 verbos irregulares (
be, have, do, go, come, see, eat, write) sin romper los 12 regulares. Tres entregables: (i) gana ≥ 10pp en accuracy de conjugación irregular; (ii) PPL de control sobre verbos regulares no crece > 5%; (iii) barrido de rangor ∈ {2,4,8,16,32}mostrando codo.
Anchors: theory/01-sft-and-forgetting.md, src/minituner/BLUEPRINT.md, lab 00 (LoRALinear working).
What you produce¶
Three experiments + analysis:
experiments/28-lora-finetune/— single training run atr=8, full DoD evaluation.experiments/28-lora-ablate-rank/— rank sweep overr ∈ {2, 4, 8, 16, 32}.- A REPORT.md summarizing both, with plots.
Prereq¶
- Lab 00 complete:
src/minituner/lora.pyworking,tests/test_minituner.pygreen. src/minituner/train.py(the training loop) implemented per BLUEPRINT.src/minituner/data.pybuilds the irregular-verb training split and the regular-verb control split.- A trained MiniGPT checkpoint from Phase 17 available at
experiments/17-minigpt-train/best.pt.
TODOs (sketch)¶
Block A — Data preparation¶
- From
data/processed/verb-corpus/(Phase 12), filter for sentences containing one of the 8 irregular verbs in past simple or past participle. Target ~500 examples. - Split into train (80%) / val (10%) / test (10%).
- Construct training pairs in the format defined by Phase 13's tokenizer:
- Prompt: a fill-in-blank sentence, e.g.,
He __ to school yesterday. - Target: the correct conjugation, e.g.,
went(plus Spanish pairfueper §A2 default). - Construct a parallel regular-verb control split filtered on the 12 regular verbs. 50-100 examples is enough.
- Verify no overlap between train irregular and control regular (they share grammatical structure but disjoint vocabulary).
Block B — Training run at r=8¶
- Load base MiniGPT checkpoint.
wrap_minigpt_with_lora(model, r=8, alpha=16.0). Confirm trainable param ratio matches lab 01's calculation.fine_tune_lora(model, train_loader, val_loader, LoRATrainConfig(...), out_dir=experiments/28-lora-finetune/).- During training, log per-step: train loss, val loss every 50 steps, plus regular-verb control PPL every 100 steps.
- Train for 3 epochs (lab 01's numbers say this completes in <30min on CPU).
Block C — Evaluation¶
The metrics:
- Task accuracy on the held-out irregular-verb test split. For each prompt-target pair, check that the model assigns higher probability to the correct conjugation than to the wrong-regular form (
wentovergoed). Report top-1 accuracy on the 50-ish held-out test items. - Control PPL drift:
drift = (PPL_after - PPL_before) / PPL_beforeon the regular-verb control split. DoD: ≤ 5%. - Base weights unchanged: load the pre-training base weights from
experiments/17-minigpt-train/best.pt. Compare to base weights in the fine-tuned model (extract viamodel.base.weightof eachLoRALinear). Assertmax_abs_diff == 0.0exactly. - Adapter checkpoint size:
experiments/28-lora-finetune/adapter_final.ptshould be < 1% of the base checkpoint size.
Block D — Rank sweep¶
- For each
r ∈ {2, 4, 8, 16, 32}: rerun Block B with that rank, same seed, same training config. - Plot task accuracy vs
r(x-axis log-scale). Plot control PPL drift vsr. Save toexperiments/28-lora-ablate-rank/accuracy_vs_rank.pngand..._drift_vs_rank.png. - Identify the elbow: the smallest
rwhere accuracy plateaus. Report it in REPORT.md.
Block E — Catastrophic-adoption check (optional)¶
Sample 20 sentences with regular verbs from outside both splits. Check that the fine-tuned model doesn't invent irregular forms (walken, playen). Report any failures.
Constraints¶
- Seed every run via
seed_everything. experiments/<exp>/manifest.jsonmandatory — includes versions, seed, config, base-model checkpoint hash.- Single LR (1e-4) across all runs. Don't sweep LR — Phase 28 isn't about LR.
- Batch size 16, 3 epochs. Adjust only if you can't fit in CPU memory.
- CPU is fine. Phase 28 doesn't require GPU.
- No use of Phase 17's training loop directly. Use
src/minituner/train.py— it's a different setup (frozen base, smaller param count).
Stop conditions¶
You're done when:
- Task accuracy improvement ≥ 10pp (DoD §6).
- Control PPL drift ≤ 5% (DoD §6).
- Base weight diff == 0.0 (asserts in eval script).
- Adapter checkpoint < 1% base size (DoD §6).
- Rank sweep plot committed; elbow location reported.
- REPORT.md filled with: tables of (a) per-rank accuracy + drift, (b) before/after task accuracy at r=8, © before/after sample predictions on 5 example prompts; one paragraph interpreting the rank sweep.
Pitfalls (specific to this lab)¶
- The corpus is small (~500 irregular examples). Overfitting is real. Use dropout in the LoRA path (default 0.05 per BLUEPRINT).
- Both correct and incorrect target forms in training data. The corpus may contain pairs
(He goed to school, He went to school)for correction-prompting. Be explicit about which side is target — if you accidentally train on the wrong target, the model regresses. - Spanish-pair leak into the accuracy metric. If the target string is
went / fueand the eval just checks "iswentthe next token", you're fine. But if it checks the whole string, factor in the slash/space tokenization. - Regular-verb control PPL measured on different sentences than training. Verify your control split's grammatical complexity matches the irregular train split — otherwise PPL drift may reflect difficulty differences, not forgetting.
- Forgetting the alpha-scale ablation. At very low rank (
r=2),α/r = 16/2 = 8is large; the update magnitude is bigger per param than atr=32. Consider whether you want to varyαwithror holdα=16fixed (the BLUEPRINT default and standard convention). Don't change without thinking. requires_gradnot actually frozen. If after training, base weights have drifted (Block C test 3 fails), the freezing logic inwrap_minigpt_with_lorais buggy. Go back to lab 00.
When to consult solutions¶
After completing Blocks A-D and writing REPORT.md. Compare your interpretation of the rank sweep elbow to the reference solution's expected location (around r=8 for our MiniGPT, per theory 02).
Estimated time¶
5-8 hours of working time (excluding training wall-clock, which is ~30min × 5 rank settings ≈ 2.5h CPU).
Next: lab/03-qlora-preview.md.