English · Español
Lab 03 — QLoRA Preview¶
🇪🇸 Combinar lo aprendido: cuantizar MiniGPT a NF4 con
src/miniquant/, encima poner LoRA consrc/minituner/, y entrenar una época. La meta NO es ganar accuracy — es comprobar que el patrón compone correctamente y medir el ahorro de memoria (modesto en MiniGPT, dramático a escala 7B+).
Anchors: src/miniquant/BLUEPRINT.md (Phase 26), src/minituner/BLUEPRINT.md, theory/03-memory-footprint.md.
What you produce¶
A single experiment experiments/28-qlora-preview/:
- A training script that composes NF4 quantization + LoRA on MiniGPT.
- A memory measurement table comparing fp16-base LoRA vs NF4-base QLoRA.
- A correctness check: outputs of QLoRA agree with fp16-base LoRA (same trained adapter) within NF4 quantization error.
Prereq¶
- Lab 02 complete: LoRA r=8 training run with
experiments/28-lora-finetune/adapter_final.ptsaved. src/miniquant/quantize.pyworking:quantize_symmetric_per_groupandQuantizedLinear(from Phase 26).src/minituner/qlora.pyimplemented per BLUEPRINT.
TODOs (sketch)¶
Block A — Composing NF4 base + LoRA¶
- Load base MiniGPT (fp16).
- Apply NF4 quantization in-place via
src/miniquant/: eachnn.Linearbecomes aQuantizedLinearwith frozen NF4 weights + per-group scales. - Apply LoRA on top via
src/minituner/qlora.py::wrap_minigpt_qlora(model, r=8, alpha=16.0, quant_scheme="nf4_per_group_64"). - Confirm:
model.layer.attn.q_projis now aQuantizedLinearwrapped in aLoRALinear. The forward path islora_branch(x) + qlinear_dequant(x).
Block B — Output correctness check¶
- Load the trained adapter from lab 02 (
experiments/28-lora-finetune/adapter_final.pt). - Reconstruct two models:
- Model A: fp16 base + LoRA adapter.
- Model B: NF4 base + LoRA adapter (same adapter weights).
- Run a fixed batch of test prompts through both. Measure
mean_abs_diffof output logits. - Expected: diff is small but non-zero (NF4 quantization adds ~1-3% error on logit magnitudes). Document the actual value.
Block C — One epoch of QLoRA training¶
- From the lab 02 starting checkpoint, run one additional epoch of QLoRA training.
- The point isn't accuracy — it's that the training loop runs without NaN/Inf and the loss decreases.
- Log: per-step train loss + max NaN/Inf check.
Block D — Memory measurement¶
For both regimes (fp16-base LoRA, NF4-base QLoRA):
- Use
psutil.Process().memory_info().rssfor CPU. (Ortorch.cuda.max_memory_allocated()if you're on GPU — Phase 28 doesn't require GPU.) - Measure right after model construction (weights only).
- Measure mid-training (weights + grads + Adam state, all allocated).
- Report the delta as the "training footprint".
Block E — Adapter swap demo (optional)¶
- Save the QLoRA-trained adapter.
- Load the lab 02 fp16-trained adapter into the same NF4-base model.
- Confirm both forward passes work; outputs differ. (Demonstrates the adapter-swap pattern that QLoRA enables at scale.)
Constraints¶
- NF4 quantization happens before LoRA wrapping. Reverse order breaks the LoRA branch's gradient flow (LoRA expects to be applied to a
Linear-like module, not raw quantized weights). - LoRA params stay fp16. Don't quantize them. Their tiny size means quantization buys nothing and complicates gradients.
- No new training data. Use lab 02's splits.
- Manifest mandatory. As always:
seed_everything,manifest.jsonwith versions + seed + base hash + adapter hash. - Single seed. Don't sweep — preview only.
Stop conditions¶
You're done when:
- Model B (NF4 base + adapter) produces outputs within tolerated NF4 quantization error of Model A (fp16 base + adapter). Document the tolerance and the measured diff.
- One epoch of QLoRA training completes without NaN/Inf.
- Memory measurements committed: a 2-row table (fp16-base LoRA vs NF4-base QLoRA) with weights / grads / Adam / total. Ratio computed.
- REPORT.md notes: "absolute savings at MiniGPT scale are X MiB; at 7B scale the same recipe would save Y GiB" with the Y derived from theory 03 numbers.
Pitfalls (specific to this lab)¶
- NaN in NF4 dequant. If
QuantizedLinear.forwarddequantizes to a tensor with NaN, the LoRA gradient explodes. Add atorch.isfiniteassertion in the dequant path and another in the LoRA path. Phase 26 should have already debugged this; re-verify. - Loading the adapter into a QuantizedLinear-wrapped model. Names will differ (
q_proj.base.weightvsq_proj.qweight). Theload_lora_state_dictfromsrc/minituner/lora.pyshould only target LoRA params — but verify. - Memory measurement under PyTorch lazy allocation. RSS measurements before the first backward pass under-report — allocator only commits when needed. Measure after at least 2 training steps.
requires_grad=Trueon the dequantized intermediate. Some QuantizedLinear implementations create a fresh fp16 dequant tensor in forward — make sure it's not added to the parameter set; it's a transient. The trainable set should still be LoRA only.- Mixed precision interactions. If MiniGPT's forward path uses some mix of fp32/fp16, the NF4 dequant intermediate must match the precision of the LoRA forward output. Mismatch → silent type promotion → memory bug.
When to consult solutions¶
After QLoRA training succeeds for one epoch and the memory table is filled. Compare your Y-at-7B-scale calculation to the reference solution's.
Estimated time¶
3-5 hours.
Phase 28 ends here. Write PHASE_28_REPORT.md and learners/borja/phase-28/reflections.md. Open Phase 29 only after explicit approval.