English · Español
Lab 00 — LoRALinear By Hand¶
🇪🇸 Implementar
LoRALineardesde cero. Tres invariantes innegociables: (1) en init la salida es idéntica a la delLinearbase; (2) sóloAyBreciben gradiente; (3)merge_produce pesos equivalentes byte-a-byte (dentro de1e-5). Sin estos tres, todo lo demás falla en silencio.
Anchors: src/minituner/BLUEPRINT.md; theory/02-parameter-count.md.
What you produce¶
A working LoRALinear module in src/minituner/lora.py that wraps an existing nn.Linear. Failing tests in tests/test_minituner.py come pre-scaffolded by Claude; your job is to fill in lora.py until they pass.
Deliverables on disk:
src/minituner/lora.py—LoRALinear,wrap_minigpt_with_lora,lora_state_dict,load_lora_state_dict.tests/test_minituner.py— green.- A short markdown note
experiments/28-lora-by-hand/notes.mdrecording: initial output diff vs base, gradient flow check, merge equivalence.
TODOs (sketch)¶
Block A — Identity-at-init LoRALinear¶
- Construct
LoRALinear(base: nn.Linear, r=8, alpha=16.0, dropout=0.0, freeze_base=True). - Allocate two parameter matrices:
A ∈ ℝ^{r × in_features},B ∈ ℝ^{out_features × r}. - Initialize:
Aviakaiming_uniform_(A, a=math.sqrt(5))(matches PyTorch default for Linear weights);B = zeros. - Optionally apply
nn.Dropout(p=dropout)to the input of the LoRA branch. - Forward:
y = base(x) + (alpha / r) * F.linear(F.linear(dropout(x), A), B). (Equivalent:(α/r) · x @ Aᵀ @ Bᵀ.) - If
freeze_base: setbase.weight.requires_grad = Falseandbase.bias.requires_grad = False(if bias exists).
Block B — Wrapping a MiniGPT¶
wrap_minigpt_with_lora(model, r=8, alpha=16.0, target_modules=("q_proj", ..., "mlp.fc2"), dropout=0.05)walksmodel.named_modules().- For each
nn.Linearwhose dotted name ends with one oftarget_modules, replace it in-place with aLoRALinearwrapping the original. - After wrapping: assert
sum(p.requires_grad for p in model.parameters() if "lora_" in name) > 0and all base params haverequires_grad=False.
Block C — Adapter checkpoint round-trip¶
lora_state_dict(model)returns a dict mapping fully-qualified parameter names → tensors, restricted to the LoRAA,B(and any per-module scale if you stored it as a buffer).load_lora_state_dict(model, state)does the inverse, validating shape compatibility.- Round-trip test: save → reload → forward output matches original to
1e-5.
Block D — Merge for inference¶
merge_(self)computesW_new = base.weight + (alpha/r) * B @ Ain-place, then zerosA, B.- Post-merge:
forward(x)matches the un-mergedforward(x)to1e-5(the only difference is float rounding). - After merging, the LoRA matrices can be re-randomized for a new adapter (do not require this — just don't error if it happens).
Constraints¶
- PyTorch is allowed (Phase 24+ unlocked it). Use
torch.nn,torch.nn.functional. Nopeftimport. - Forward must remain numerically identical to base at init — within machine epsilon. The test
B = 0 ⇒ LoRA(x) == base(x)is non-negotiable. - No use of
nn.Linearas the LoRA matrices — instantiatenn.Parameterdirectly. Why: makes the parameter set explicit; avoids surprise biases. - Do not use
register_module("base", base)blindly — make sure the base'srequires_grad=Falseis preserved through saving/loading. - Reproducibility: seed via the conftest fixture before initializing
A.
Stop conditions¶
You're done when:
pytest tests/test_minituner.py -k "lora_init_identity or param_count or merge or freeze or roundtrip"is green.mypy --strict src/minituner/lora.pypasses.notes.mdrecords measured values: initial output diff (should be0.0), pre-/post-merge diff (should be< 1e-5), trainable param count for aLinear(64, 64)withr=8(should be1024).
Pitfalls (specific to this lab)¶
B = zerosnotB ~ small_random. RandomBmakes step 0 non-identity; this is the most common LoRA implementation bug. Test it explicitly: assert(LoRALinear(x) - base(x)).abs().max() == 0after__init__.- Forgetting the
α/rscale. Without it, doublingrdoubles effective LR — confusing experiments. Hardcodealpha=16.0default; never silently default toalpha=r. - In-place modification of base weights.
merge_should produce a newnn.Linearweight viadata.copy_or build a freshLoRALinearwith frozen new base. Either is fine; don't accidentally retain LoRA params after merge. requires_gradleak. If you create the LoRAnn.Parameterafter settingbase.requires_grad = False, but then a latermodel.train()orto(device)call resets things — easy to miss. Test the freeze property explicitly post-.to(device).- Dropout in the LoRA path with
model.eval(). Dropout should be off in eval; if you implemented it via rawF.dropout(..., self.training), double-check. Easy withnn.Dropoutinstance.
When to consult solutions¶
After the failing tests stop telling you anything new (you've stared at one for >20 minutes), open solutions/00-lora-by-hand-ref.md. Solutions are written after Borja's first attempt.
Estimated time¶
3-5 hours. The conceptual difficulty is low (you've read theory 02); the implementation difficulty is moderate (PyTorch's nn.Module and parameter-registration subtleties).
Next: lab/01-lora-counts.md.