English · Español

Lab 03 — QLoRA Preview¶

🇪🇸 Combinar lo aprendido: cuantizar MiniGPT a NF4 con src/miniquant/, encima poner LoRA con src/minituner/, y entrenar una época. La meta NO es ganar accuracy — es comprobar que el patrón compone correctamente y medir el ahorro de memoria (modesto en MiniGPT, dramático a escala 7B+).

Anchors: src/miniquant/BLUEPRINT.md (Phase 26), src/minituner/BLUEPRINT.md, theory/03-memory-footprint.md.

What you produce¶

A single experiment experiments/28-qlora-preview/:

A training script that composes NF4 quantization + LoRA on MiniGPT.
A memory measurement table comparing fp16-base LoRA vs NF4-base QLoRA.
A correctness check: outputs of QLoRA agree with fp16-base LoRA (same trained adapter) within NF4 quantization error.

Prereq¶

Lab 02 complete: LoRA r=8 training run with experiments/28-lora-finetune/adapter_final.pt saved.
src/miniquant/quantize.py working: quantize_symmetric_per_group and QuantizedLinear (from Phase 26).
src/minituner/qlora.py implemented per BLUEPRINT.

TODOs (sketch)¶

Block A — Composing NF4 base + LoRA¶

Load base MiniGPT (fp16).
Apply NF4 quantization in-place via src/miniquant/: each nn.Linear becomes a QuantizedLinear with frozen NF4 weights + per-group scales.
Apply LoRA on top via src/minituner/qlora.py::wrap_minigpt_qlora(model, r=8, alpha=16.0, quant_scheme="nf4_per_group_64").
Confirm: model.layer.attn.q_proj is now a QuantizedLinear wrapped in a LoRALinear. The forward path is lora_branch(x) + qlinear_dequant(x).

Block B — Output correctness check¶

Load the trained adapter from lab 02 (experiments/28-lora-finetune/adapter_final.pt).
Reconstruct two models:
Model A: fp16 base + LoRA adapter.
Model B: NF4 base + LoRA adapter (same adapter weights).
Run a fixed batch of test prompts through both. Measure mean_abs_diff of output logits.
Expected: diff is small but non-zero (NF4 quantization adds ~1-3% error on logit magnitudes). Document the actual value.

Block C — One epoch of QLoRA training¶

From the lab 02 starting checkpoint, run one additional epoch of QLoRA training.
The point isn't accuracy — it's that the training loop runs without NaN/Inf and the loss decreases.
Log: per-step train loss + max NaN/Inf check.

Block D — Memory measurement¶

For both regimes (fp16-base LoRA, NF4-base QLoRA):

Use psutil.Process().memory_info().rss for CPU. (Or torch.cuda.max_memory_allocated() if you're on GPU — Phase 28 doesn't require GPU.)
Measure right after model construction (weights only).
Measure mid-training (weights + grads + Adam state, all allocated).
Report the delta as the "training footprint".

Block E — Adapter swap demo (optional)¶

Save the QLoRA-trained adapter.
Load the lab 02 fp16-trained adapter into the same NF4-base model.
Confirm both forward passes work; outputs differ. (Demonstrates the adapter-swap pattern that QLoRA enables at scale.)

Constraints¶

NF4 quantization happens before LoRA wrapping. Reverse order breaks the LoRA branch's gradient flow (LoRA expects to be applied to a Linear-like module, not raw quantized weights).
LoRA params stay fp16. Don't quantize them. Their tiny size means quantization buys nothing and complicates gradients.
No new training data. Use lab 02's splits.
Manifest mandatory. As always: seed_everything, manifest.json with versions + seed + base hash + adapter hash.
Single seed. Don't sweep — preview only.

Stop conditions¶

You're done when:

Model B (NF4 base + adapter) produces outputs within tolerated NF4 quantization error of Model A (fp16 base + adapter). Document the tolerance and the measured diff.
One epoch of QLoRA training completes without NaN/Inf.
Memory measurements committed: a 2-row table (fp16-base LoRA vs NF4-base QLoRA) with weights / grads / Adam / total. Ratio computed.
REPORT.md notes: "absolute savings at MiniGPT scale are X MiB; at 7B scale the same recipe would save Y GiB" with the Y derived from theory 03 numbers.

Pitfalls (specific to this lab)¶

NaN in NF4 dequant. If QuantizedLinear.forward dequantizes to a tensor with NaN, the LoRA gradient explodes. Add a torch.isfinite assertion in the dequant path and another in the LoRA path. Phase 26 should have already debugged this; re-verify.
Loading the adapter into a QuantizedLinear-wrapped model. Names will differ (q_proj.base.weight vs q_proj.qweight). The load_lora_state_dict from src/minituner/lora.py should only target LoRA params — but verify.
Memory measurement under PyTorch lazy allocation. RSS measurements before the first backward pass under-report — allocator only commits when needed. Measure after at least 2 training steps.
requires_grad=True on the dequantized intermediate. Some QuantizedLinear implementations create a fresh fp16 dequant tensor in forward — make sure it's not added to the parameter set; it's a transient. The trainable set should still be LoRA only.
Mixed precision interactions. If MiniGPT's forward path uses some mix of fp32/fp16, the NF4 dequant intermediate must match the precision of the LoRA forward output. Mismatch → silent type promotion → memory bug.

When to consult solutions¶

After QLoRA training succeeds for one epoch and the memory table is filled. Compare your Y-at-7B-scale calculation to the reference solution's.

Estimated time¶

3-5 hours.

Phase 28 ends here. Write PHASE_28_REPORT.md and learners/borja/phase-28/reflections.md. Open Phase 29 only after explicit approval.