English · Español

05 — LoRA on the actual Mini-GPT: ΔW = BA, exact parameter count, rank trade-off¶

🇪🇸 La theory 02-parameter-count.md usa h=768 como ejemplo genérico (escala GPT-2). Este archivo aplica la misma derivación al Mini-GPT real (d_model=64, d_ff=256, n_layers=2) y calcula el conteo exacto de parámetros LoRA para los ranks que vamos a entrenar. También deriva el régimen rank-vs-accuracy con cifras.

Anchors: Phase 17 lab/02-parameter-inventory.md (Mini-GPT total = 103 680), this phase theory/02-parameter-count.md (the generic derivation).

ΔW = BA, written out¶

LoRA reparameterizes the weight update for a Linear layer W ∈ ℝ^(out × in):

\[ W_{\text{eff}} = W + \frac{\alpha}{r} \, B A, \qquad B \in \mathbb{R}^{\text{out} \times r}, \quad A \in \mathbb{R}^{r \times \text{in}} \]

with B initialized to zero and A initialized to small random. Three invariants are non-negotiable:

Rank constraint. rank(BA) ≤ r. The update lives in an r-dimensional subspace of the full out × in weight space.
Step-0 invariance. B = 0 at init means BA = 0 and the LoRA-equipped model produces identical outputs to the base for the first forward pass.
Scaling decoupling. α/r keeps the effective update magnitude roughly constant as r varies, so tuning α doesn't need to follow r.

The non-LoRA path keeps the full out × in parameters; the LoRA path adds r × (in + out) trainable params and freezes everything else.

Mini-GPT's actual Linear inventory¶

From Phase 17's parameter inventory:

Layer	Shape (out, in)	Params	Notes
`block_0.attn.W_q`	(64, 64)	4 096	no bias
`block_0.attn.W_k`	(64, 64)	4 096	no bias
`block_0.attn.W_v`	(64, 64)	4 096	no bias
`block_0.attn.W_o`	(64, 64)	4 096	no bias
`block_0.ffn.W_1`	(256, 64)	16 384 + 256 bias
`block_0.ffn.W_2`	(64, 256)	16 384 + 64 bias
`block_1.attn.{Q,K,V,O}`	(64, 64) ×4	16 384
`block_1.ffn.W_1`	(256, 64)	16 384 + 256 bias
`block_1.ffn.W_2`	(64, 256)	16 384 + 64 bias
Linear total		98 304 + 640 bias = 98 944
Embedding (tied)	(64, 64)	4 096	not LoRA-fied
LayerNorms	various	640	not LoRA-fied
Model total		103 680	matches Phase 17

We target all 12 Linear layers for LoRA — the convention from QLoRA. Embeddings and LayerNorms stay frozen (standard practice).

Exact LoRA parameter count per rank¶

For a Linear of shape (out, in), LoRA adds r × (in + out) trainable params.

Attention Linears (`out = in = 64`)¶

Per Linear at rank r: r × (64 + 64) = 128 r. Eight attention Linears total (4 per block × 2 blocks). Sum: 8 × 128 r = 1024 r.

FFN `W_1` Linears (`out = 256, in = 64`)¶

Per Linear at rank r: r × (64 + 256) = 320 r. Two total (one per block). Sum: 2 × 320 r = 640 r.

FFN `W_2` Linears (`out = 64, in = 256`)¶

Per Linear at rank r: r × (256 + 64) = 320 r. Two total. Sum: 640 r.

Total trainable LoRA parameters¶

\[ N_{\text{LoRA}}(r) = 1024 r + 640 r + 640 r = 2304 \, r \]

Plug in:

Rank `r`	LoRA params	Fraction of 103 680	Fraction of 98 944 (Linears only)
1	2 304	2.22%	2.33%
2	4 608	4.44%	4.66%
4	9 216	8.89%	9.31%
8	18 432	17.78%	18.63%
16	36 864	35.56%	37.26%
32	73 728	71.11%	74.51%
64	147 456	142.22% (exceeds!)	149.02%

At r = 64, LoRA has more parameters than full FT — because for r ≥ min(in, out) of a square layer, the LoRA decomposition has the same capacity as full FT but writes it as two matrices. So the "LoRA is parameter-efficient" claim holds only up to roughly r ≤ min(in, out) / 2.

For the §A13 grammar-tutor task, the canonical choices are r = 4 or r = 8 — small enough to be efficient, large enough to express irregular-verb-specific updates.

Why bigger models benefit more (Mini-GPT is unusually unflattering to LoRA)¶

Recall the theory file 02-parameter-count.md formula: ratio per layer = 1.5 r / h. For:

Mini-GPT (h = 64): ratio = 1.5 × 8 / 64 = 0.19 → 19% per square Linear at r = 8. Huge by LoRA standards.
LLaMA-7B (h = 4096): ratio = 1.5 × 8 / 4096 = 0.003 → 0.3%.
LLaMA-70B (h = 8192): ratio = 0.0015 → 0.15%.

Mini-GPT is small enough that LoRA's relative efficiency is mediocre. We use it anyway because:

Pedagogy: the rank trade-off is measurable on a tiny model.
CPU-feasibility: even rank-8 LoRA on Mini-GPT trains in seconds on Borja's i5-8250U; rank-8 on LLaMA-7B does not.
Conceptual clarity: the formula 2304 r is closed-form and can be hand-checked.

For Phase 28's lab we use r = 8 as the default. Phase 32's grammar tutor will use the same rank.

The rank-vs-accuracy regime (predictions, not measurements)¶

LoRA accuracy on §A13 irregular-verb fine-tuning, expected:

Rank `r`	LoRA params	Expected eval accuracy (irreg verbs)	Notes
0	0	0 base accuracy	degenerate — no update possible (see `/break` exercise)
1	2 304	60-70%	one-dimensional update; not enough for 8 irregulars
2	4 608	75-85%	starting to fit
4	9 216	88-93%	sweet spot
8	18 432	92-95%	sweet spot; default
16	36 864	93-95%	diminishing returns
32	73 728	93-95%	overfits the small irreg set

Empirical pattern from the LoRA paper (Hu et al., 2021): the accuracy curve saturates quickly around r = 8. The Phase 28 DoD asks you to reproduce this saturation shape on Mini-GPT — the absolute numbers will shift, but the shape will not.

The "sweet spot at r = 8" is task-specific. For more complex updates (e.g., a domain shift from English to Spanish, not just irregular-verb correction), bigger r helps until you're effectively doing full FT.

How big the checkpoint is¶

A LoRA checkpoint at rank r in fp16 is 2 × N_LoRA(r) bytes:

Rank `r`	LoRA params	fp16 bytes	vs Mini-GPT base (`103 680 × 4 = 414 720`)
4	9 216	18 432 (≈ 18 KiB)	4.4%
8	18 432	36 864 (≈ 36 KiB)	8.9%
16	36 864	73 728 (≈ 72 KiB)	17.8%

Even rank-16 fits comfortably in the same order of magnitude as the base; you could ship 30 distinct adapter heads for the same disk cost as one Mini-GPT copy.

Citations¶

Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.
Dettmers, Pagnoni, Holtzman, Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314.

One-paragraph recap¶

For Mini-GPT, the closed-form LoRA parameter count over all 12 Linear layers is exactly 2304 × r trainable parameters; at the default r = 8 that's 18 432 params, ≈ 18% of the model. The relative efficiency is unflattering because Mini-GPT's hidden dim is small — production transformers see 0.1-0.3% trainable fractions for the same rank. The rank-vs-accuracy curve saturates around r = 8 on the §A13 irregular-verb fine-tune; that saturation shape is the empirical signature you should reproduce. The B = 0 initialization ensures the model starts equal to the base; the α/r scaling decouples update magnitude from rank choice. The result: a robust, predictable parameter knob whose math you can verify by hand.

Next: lab/00-lora-by-hand.md to implement LoRALinear and verify the count.