Skip to content

English · Español

Lab 03 — QLoRA Preview

🇪🇸 Combinar lo aprendido: cuantizar MiniGPT a NF4 con src/miniquant/, encima poner LoRA con src/minituner/, y entrenar una época. La meta NO es ganar accuracy — es comprobar que el patrón compone correctamente y medir el ahorro de memoria (modesto en MiniGPT, dramático a escala 7B+).

Anchors: src/miniquant/BLUEPRINT.md (Phase 26), src/minituner/BLUEPRINT.md, theory/03-memory-footprint.md.


What you produce

A single experiment experiments/28-qlora-preview/:

  • A training script that composes NF4 quantization + LoRA on MiniGPT.
  • A memory measurement table comparing fp16-base LoRA vs NF4-base QLoRA.
  • A correctness check: outputs of QLoRA agree with fp16-base LoRA (same trained adapter) within NF4 quantization error.

Prereq

  • Lab 02 complete: LoRA r=8 training run with experiments/28-lora-finetune/adapter_final.pt saved.
  • src/miniquant/quantize.py working: quantize_symmetric_per_group and QuantizedLinear (from Phase 26).
  • src/minituner/qlora.py implemented per BLUEPRINT.

TODOs (sketch)

Block A — Composing NF4 base + LoRA

  1. Load base MiniGPT (fp16).
  2. Apply NF4 quantization in-place via src/miniquant/: each nn.Linear becomes a QuantizedLinear with frozen NF4 weights + per-group scales.
  3. Apply LoRA on top via src/minituner/qlora.py::wrap_minigpt_qlora(model, r=8, alpha=16.0, quant_scheme="nf4_per_group_64").
  4. Confirm: model.layer.attn.q_proj is now a QuantizedLinear wrapped in a LoRALinear. The forward path is lora_branch(x) + qlinear_dequant(x).

Block B — Output correctness check

  1. Load the trained adapter from lab 02 (experiments/28-lora-finetune/adapter_final.pt).
  2. Reconstruct two models:
  3. Model A: fp16 base + LoRA adapter.
  4. Model B: NF4 base + LoRA adapter (same adapter weights).
  5. Run a fixed batch of test prompts through both. Measure mean_abs_diff of output logits.
  6. Expected: diff is small but non-zero (NF4 quantization adds ~1-3% error on logit magnitudes). Document the actual value.

Block C — One epoch of QLoRA training

  1. From the lab 02 starting checkpoint, run one additional epoch of QLoRA training.
  2. The point isn't accuracy — it's that the training loop runs without NaN/Inf and the loss decreases.
  3. Log: per-step train loss + max NaN/Inf check.

Block D — Memory measurement

For both regimes (fp16-base LoRA, NF4-base QLoRA):

  1. Use psutil.Process().memory_info().rss for CPU. (Or torch.cuda.max_memory_allocated() if you're on GPU — Phase 28 doesn't require GPU.)
  2. Measure right after model construction (weights only).
  3. Measure mid-training (weights + grads + Adam state, all allocated).
  4. Report the delta as the "training footprint".

Block E — Adapter swap demo (optional)

  1. Save the QLoRA-trained adapter.
  2. Load the lab 02 fp16-trained adapter into the same NF4-base model.
  3. Confirm both forward passes work; outputs differ. (Demonstrates the adapter-swap pattern that QLoRA enables at scale.)

Constraints

  • NF4 quantization happens before LoRA wrapping. Reverse order breaks the LoRA branch's gradient flow (LoRA expects to be applied to a Linear-like module, not raw quantized weights).
  • LoRA params stay fp16. Don't quantize them. Their tiny size means quantization buys nothing and complicates gradients.
  • No new training data. Use lab 02's splits.
  • Manifest mandatory. As always: seed_everything, manifest.json with versions + seed + base hash + adapter hash.
  • Single seed. Don't sweep — preview only.

Stop conditions

You're done when:

  1. Model B (NF4 base + adapter) produces outputs within tolerated NF4 quantization error of Model A (fp16 base + adapter). Document the tolerance and the measured diff.
  2. One epoch of QLoRA training completes without NaN/Inf.
  3. Memory measurements committed: a 2-row table (fp16-base LoRA vs NF4-base QLoRA) with weights / grads / Adam / total. Ratio computed.
  4. REPORT.md notes: "absolute savings at MiniGPT scale are X MiB; at 7B scale the same recipe would save Y GiB" with the Y derived from theory 03 numbers.

Pitfalls (specific to this lab)

  1. NaN in NF4 dequant. If QuantizedLinear.forward dequantizes to a tensor with NaN, the LoRA gradient explodes. Add a torch.isfinite assertion in the dequant path and another in the LoRA path. Phase 26 should have already debugged this; re-verify.
  2. Loading the adapter into a QuantizedLinear-wrapped model. Names will differ (q_proj.base.weight vs q_proj.qweight). The load_lora_state_dict from src/minituner/lora.py should only target LoRA params — but verify.
  3. Memory measurement under PyTorch lazy allocation. RSS measurements before the first backward pass under-report — allocator only commits when needed. Measure after at least 2 training steps.
  4. requires_grad=True on the dequantized intermediate. Some QuantizedLinear implementations create a fresh fp16 dequant tensor in forward — make sure it's not added to the parameter set; it's a transient. The trainable set should still be LoRA only.
  5. Mixed precision interactions. If MiniGPT's forward path uses some mix of fp32/fp16, the NF4 dequant intermediate must match the precision of the LoRA forward output. Mismatch → silent type promotion → memory bug.

When to consult solutions

After QLoRA training succeeds for one epoch and the memory table is filled. Compare your Y-at-7B-scale calculation to the reference solution's.

Estimated time

3-5 hours.


Phase 28 ends here. Write PHASE_28_REPORT.md and learners/borja/phase-28/reflections.md. Open Phase 29 only after explicit approval.