English · Español
05 — LoRA on the actual Mini-GPT: ΔW = BA, exact parameter count, rank trade-off¶
🇪🇸 La theory
02-parameter-count.mdusah=768como ejemplo genérico (escala GPT-2). Este archivo aplica la misma derivación al Mini-GPT real (d_model=64, d_ff=256, n_layers=2) y calcula el conteo exacto de parámetros LoRA para los ranks que vamos a entrenar. También deriva el régimen rank-vs-accuracy con cifras.
Anchors: Phase 17 lab/02-parameter-inventory.md (Mini-GPT total = 103 680), this phase theory/02-parameter-count.md (the generic derivation).
ΔW = BA, written out¶
LoRA reparameterizes the weight update for a Linear layer W ∈ ℝ^(out × in):
with B initialized to zero and A initialized to small random. Three invariants are non-negotiable:
- Rank constraint.
rank(BA) ≤ r. The update lives in anr-dimensional subspace of the fullout × inweight space. - Step-0 invariance.
B = 0at init meansBA = 0and the LoRA-equipped model produces identical outputs to the base for the first forward pass. - Scaling decoupling.
α/rkeeps the effective update magnitude roughly constant asrvaries, so tuningαdoesn't need to followr.
The non-LoRA path keeps the full out × in parameters; the LoRA path adds r × (in + out) trainable params and freezes everything else.
Mini-GPT's actual Linear inventory¶
From Phase 17's parameter inventory:
| Layer | Shape (out, in) | Params | Notes |
|---|---|---|---|
block_0.attn.W_q |
(64, 64) | 4 096 | no bias |
block_0.attn.W_k |
(64, 64) | 4 096 | no bias |
block_0.attn.W_v |
(64, 64) | 4 096 | no bias |
block_0.attn.W_o |
(64, 64) | 4 096 | no bias |
block_0.ffn.W_1 |
(256, 64) | 16 384 + 256 bias | |
block_0.ffn.W_2 |
(64, 256) | 16 384 + 64 bias | |
block_1.attn.{Q,K,V,O} |
(64, 64) ×4 | 16 384 | |
block_1.ffn.W_1 |
(256, 64) | 16 384 + 256 bias | |
block_1.ffn.W_2 |
(64, 256) | 16 384 + 64 bias | |
| Linear total | 98 304 + 640 bias = 98 944 | ||
| Embedding (tied) | (64, 64) | 4 096 | not LoRA-fied |
| LayerNorms | various | 640 | not LoRA-fied |
| Model total | 103 680 | matches Phase 17 |
We target all 12 Linear layers for LoRA — the convention from QLoRA. Embeddings and LayerNorms stay frozen (standard practice).
Exact LoRA parameter count per rank¶
For a Linear of shape (out, in), LoRA adds r × (in + out) trainable params.
Attention Linears (out = in = 64)¶
Per Linear at rank r: r × (64 + 64) = 128 r. Eight attention Linears total (4 per block × 2 blocks). Sum: 8 × 128 r = 1024 r.
FFN W_1 Linears (out = 256, in = 64)¶
Per Linear at rank r: r × (64 + 256) = 320 r. Two total (one per block). Sum: 2 × 320 r = 640 r.
FFN W_2 Linears (out = 64, in = 256)¶
Per Linear at rank r: r × (256 + 64) = 320 r. Two total. Sum: 640 r.
Total trainable LoRA parameters¶
Plug in:
Rank r |
LoRA params | Fraction of 103 680 | Fraction of 98 944 (Linears only) |
|---|---|---|---|
| 1 | 2 304 | 2.22% | 2.33% |
| 2 | 4 608 | 4.44% | 4.66% |
| 4 | 9 216 | 8.89% | 9.31% |
| 8 | 18 432 | 17.78% | 18.63% |
| 16 | 36 864 | 35.56% | 37.26% |
| 32 | 73 728 | 71.11% | 74.51% |
| 64 | 147 456 | 142.22% (exceeds!) | 149.02% |
At r = 64, LoRA has more parameters than full FT — because for r ≥ min(in, out) of a square layer, the LoRA decomposition has the same capacity as full FT but writes it as two matrices. So the "LoRA is parameter-efficient" claim holds only up to roughly r ≤ min(in, out) / 2.
For the §A13 grammar-tutor task, the canonical choices are r = 4 or r = 8 — small enough to be efficient, large enough to express irregular-verb-specific updates.
Why bigger models benefit more (Mini-GPT is unusually unflattering to LoRA)¶
Recall the theory file 02-parameter-count.md formula: ratio per layer = 1.5 r / h. For:
- Mini-GPT (
h = 64): ratio =1.5 × 8 / 64 = 0.19→ 19% per square Linear atr = 8. Huge by LoRA standards. - LLaMA-7B (
h = 4096): ratio =1.5 × 8 / 4096 = 0.003→ 0.3%. - LLaMA-70B (
h = 8192): ratio =0.0015→ 0.15%.
Mini-GPT is small enough that LoRA's relative efficiency is mediocre. We use it anyway because:
- Pedagogy: the rank trade-off is measurable on a tiny model.
- CPU-feasibility: even rank-8 LoRA on Mini-GPT trains in seconds on Borja's i5-8250U; rank-8 on LLaMA-7B does not.
- Conceptual clarity: the formula
2304 ris closed-form and can be hand-checked.
For Phase 28's lab we use r = 8 as the default. Phase 32's grammar tutor will use the same rank.
The rank-vs-accuracy regime (predictions, not measurements)¶
LoRA accuracy on §A13 irregular-verb fine-tuning, expected:
Rank r |
LoRA params | Expected eval accuracy (irreg verbs) | Notes |
|---|---|---|---|
| 0 | 0 | 0 base accuracy | degenerate — no update possible (see /break exercise) |
| 1 | 2 304 | 60-70% | one-dimensional update; not enough for 8 irregulars |
| 2 | 4 608 | 75-85% | starting to fit |
| 4 | 9 216 | 88-93% | sweet spot |
| 8 | 18 432 | 92-95% | sweet spot; default |
| 16 | 36 864 | 93-95% | diminishing returns |
| 32 | 73 728 | 93-95% | overfits the small irreg set |
Empirical pattern from the LoRA paper (Hu et al., 2021): the accuracy curve saturates quickly around r = 8. The Phase 28 DoD asks you to reproduce this saturation shape on Mini-GPT — the absolute numbers will shift, but the shape will not.
The "sweet spot at r = 8" is task-specific. For more complex updates (e.g., a domain shift from English to Spanish, not just irregular-verb correction), bigger r helps until you're effectively doing full FT.
How big the checkpoint is¶
A LoRA checkpoint at rank r in fp16 is 2 × N_LoRA(r) bytes:
Rank r |
LoRA params | fp16 bytes | vs Mini-GPT base (103 680 × 4 = 414 720) |
|---|---|---|---|
| 4 | 9 216 | 18 432 (≈ 18 KiB) | 4.4% |
| 8 | 18 432 | 36 864 (≈ 36 KiB) | 8.9% |
| 16 | 36 864 | 73 728 (≈ 72 KiB) | 17.8% |
Even rank-16 fits comfortably in the same order of magnitude as the base; you could ship 30 distinct adapter heads for the same disk cost as one Mini-GPT copy.
Citations¶
- Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.
- Dettmers, Pagnoni, Holtzman, Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314.
One-paragraph recap¶
For Mini-GPT, the closed-form LoRA parameter count over all 12 Linear layers is exactly 2304 × r trainable parameters; at the default r = 8 that's 18 432 params, ≈ 18% of the model. The relative efficiency is unflattering because Mini-GPT's hidden dim is small — production transformers see 0.1-0.3% trainable fractions for the same rank. The rank-vs-accuracy curve saturates around r = 8 on the §A13 irregular-verb fine-tune; that saturation shape is the empirical signature you should reproduce. The B = 0 initialization ensures the model starts equal to the base; the α/r scaling decouples update magnitude from rank choice. The result: a robust, predictable parameter knob whose math you can verify by hand.
Next: lab/00-lora-by-hand.md to implement LoRALinear and verify the count.