Skip to content

English · Español

Lab 00 — LoRALinear By Hand

🇪🇸 Implementar LoRALinear desde cero. Tres invariantes innegociables: (1) en init la salida es idéntica a la del Linear base; (2) sólo A y B reciben gradiente; (3) merge_ produce pesos equivalentes byte-a-byte (dentro de 1e-5). Sin estos tres, todo lo demás falla en silencio.

Anchors: src/minituner/BLUEPRINT.md; theory/02-parameter-count.md.


What you produce

A working LoRALinear module in src/minituner/lora.py that wraps an existing nn.Linear. Failing tests in tests/test_minituner.py come pre-scaffolded by Claude; your job is to fill in lora.py until they pass.

Deliverables on disk:

  • src/minituner/lora.pyLoRALinear, wrap_minigpt_with_lora, lora_state_dict, load_lora_state_dict.
  • tests/test_minituner.py — green.
  • A short markdown note experiments/28-lora-by-hand/notes.md recording: initial output diff vs base, gradient flow check, merge equivalence.

TODOs (sketch)

Block A — Identity-at-init LoRALinear

  1. Construct LoRALinear(base: nn.Linear, r=8, alpha=16.0, dropout=0.0, freeze_base=True).
  2. Allocate two parameter matrices: A ∈ ℝ^{r × in_features}, B ∈ ℝ^{out_features × r}.
  3. Initialize: A via kaiming_uniform_(A, a=math.sqrt(5)) (matches PyTorch default for Linear weights); B = zeros.
  4. Optionally apply nn.Dropout(p=dropout) to the input of the LoRA branch.
  5. Forward: y = base(x) + (alpha / r) * F.linear(F.linear(dropout(x), A), B). (Equivalent: (α/r) · x @ Aᵀ @ Bᵀ.)
  6. If freeze_base: set base.weight.requires_grad = False and base.bias.requires_grad = False (if bias exists).

Block B — Wrapping a MiniGPT

  1. wrap_minigpt_with_lora(model, r=8, alpha=16.0, target_modules=("q_proj", ..., "mlp.fc2"), dropout=0.05) walks model.named_modules().
  2. For each nn.Linear whose dotted name ends with one of target_modules, replace it in-place with a LoRALinear wrapping the original.
  3. After wrapping: assert sum(p.requires_grad for p in model.parameters() if "lora_" in name) > 0 and all base params have requires_grad=False.

Block C — Adapter checkpoint round-trip

  1. lora_state_dict(model) returns a dict mapping fully-qualified parameter names → tensors, restricted to the LoRA A, B (and any per-module scale if you stored it as a buffer).
  2. load_lora_state_dict(model, state) does the inverse, validating shape compatibility.
  3. Round-trip test: save → reload → forward output matches original to 1e-5.

Block D — Merge for inference

  1. merge_(self) computes W_new = base.weight + (alpha/r) * B @ A in-place, then zeros A, B.
  2. Post-merge: forward(x) matches the un-merged forward(x) to 1e-5 (the only difference is float rounding).
  3. After merging, the LoRA matrices can be re-randomized for a new adapter (do not require this — just don't error if it happens).

Constraints

  • PyTorch is allowed (Phase 24+ unlocked it). Use torch.nn, torch.nn.functional. No peft import.
  • Forward must remain numerically identical to base at init — within machine epsilon. The test B = 0 ⇒ LoRA(x) == base(x) is non-negotiable.
  • No use of nn.Linear as the LoRA matrices — instantiate nn.Parameter directly. Why: makes the parameter set explicit; avoids surprise biases.
  • Do not use register_module("base", base) blindly — make sure the base's requires_grad=False is preserved through saving/loading.
  • Reproducibility: seed via the conftest fixture before initializing A.

Stop conditions

You're done when:

  1. pytest tests/test_minituner.py -k "lora_init_identity or param_count or merge or freeze or roundtrip" is green.
  2. mypy --strict src/minituner/lora.py passes.
  3. notes.md records measured values: initial output diff (should be 0.0), pre-/post-merge diff (should be < 1e-5), trainable param count for a Linear(64, 64) with r=8 (should be 1024).

Pitfalls (specific to this lab)

  1. B = zeros not B ~ small_random. Random B makes step 0 non-identity; this is the most common LoRA implementation bug. Test it explicitly: assert (LoRALinear(x) - base(x)).abs().max() == 0 after __init__.
  2. Forgetting the α/r scale. Without it, doubling r doubles effective LR — confusing experiments. Hardcode alpha=16.0 default; never silently default to alpha=r.
  3. In-place modification of base weights. merge_ should produce a new nn.Linear weight via data.copy_ or build a fresh LoRALinear with frozen new base. Either is fine; don't accidentally retain LoRA params after merge.
  4. requires_grad leak. If you create the LoRA nn.Parameter after setting base.requires_grad = False, but then a later model.train() or to(device) call resets things — easy to miss. Test the freeze property explicitly post-.to(device).
  5. Dropout in the LoRA path with model.eval(). Dropout should be off in eval; if you implemented it via raw F.dropout(..., self.training), double-check. Easy with nn.Dropout instance.

When to consult solutions

After the failing tests stop telling you anything new (you've stared at one for >20 minutes), open solutions/00-lora-by-hand-ref.md. Solutions are written after Borja's first attempt.

Estimated time

3-5 hours. The conceptual difficulty is low (you've read theory 02); the implementation difficulty is moderate (PyTorch's nn.Module and parameter-registration subtleties).


Next: lab/01-lora-counts.md.