Skip to content

English · Español

05 — LoRA on the actual Mini-GPT: ΔW = BA, exact parameter count, rank trade-off

🇪🇸 La theory 02-parameter-count.md usa h=768 como ejemplo genérico (escala GPT-2). Este archivo aplica la misma derivación al Mini-GPT real (d_model=64, d_ff=256, n_layers=2) y calcula el conteo exacto de parámetros LoRA para los ranks que vamos a entrenar. También deriva el régimen rank-vs-accuracy con cifras.

Anchors: Phase 17 lab/02-parameter-inventory.md (Mini-GPT total = 103 680), this phase theory/02-parameter-count.md (the generic derivation).


ΔW = BA, written out

LoRA reparameterizes the weight update for a Linear layer W ∈ ℝ^(out × in):

\[ W_{\text{eff}} = W + \frac{\alpha}{r} \, B A, \qquad B \in \mathbb{R}^{\text{out} \times r}, \quad A \in \mathbb{R}^{r \times \text{in}} \]

with B initialized to zero and A initialized to small random. Three invariants are non-negotiable:

  1. Rank constraint. rank(BA) ≤ r. The update lives in an r-dimensional subspace of the full out × in weight space.
  2. Step-0 invariance. B = 0 at init means BA = 0 and the LoRA-equipped model produces identical outputs to the base for the first forward pass.
  3. Scaling decoupling. α/r keeps the effective update magnitude roughly constant as r varies, so tuning α doesn't need to follow r.

The non-LoRA path keeps the full out × in parameters; the LoRA path adds r × (in + out) trainable params and freezes everything else.

Mini-GPT's actual Linear inventory

From Phase 17's parameter inventory:

Layer Shape (out, in) Params Notes
block_0.attn.W_q (64, 64) 4 096 no bias
block_0.attn.W_k (64, 64) 4 096 no bias
block_0.attn.W_v (64, 64) 4 096 no bias
block_0.attn.W_o (64, 64) 4 096 no bias
block_0.ffn.W_1 (256, 64) 16 384 + 256 bias
block_0.ffn.W_2 (64, 256) 16 384 + 64 bias
block_1.attn.{Q,K,V,O} (64, 64) ×4 16 384
block_1.ffn.W_1 (256, 64) 16 384 + 256 bias
block_1.ffn.W_2 (64, 256) 16 384 + 64 bias
Linear total 98 304 + 640 bias = 98 944
Embedding (tied) (64, 64) 4 096 not LoRA-fied
LayerNorms various 640 not LoRA-fied
Model total 103 680 matches Phase 17

We target all 12 Linear layers for LoRA — the convention from QLoRA. Embeddings and LayerNorms stay frozen (standard practice).

Exact LoRA parameter count per rank

For a Linear of shape (out, in), LoRA adds r × (in + out) trainable params.

Attention Linears (out = in = 64)

Per Linear at rank r: r × (64 + 64) = 128 r. Eight attention Linears total (4 per block × 2 blocks). Sum: 8 × 128 r = 1024 r.

FFN W_1 Linears (out = 256, in = 64)

Per Linear at rank r: r × (64 + 256) = 320 r. Two total (one per block). Sum: 2 × 320 r = 640 r.

FFN W_2 Linears (out = 64, in = 256)

Per Linear at rank r: r × (256 + 64) = 320 r. Two total. Sum: 640 r.

Total trainable LoRA parameters

\[ N_{\text{LoRA}}(r) = 1024 r + 640 r + 640 r = 2304 \, r \]

Plug in:

Rank r LoRA params Fraction of 103 680 Fraction of 98 944 (Linears only)
1 2 304 2.22% 2.33%
2 4 608 4.44% 4.66%
4 9 216 8.89% 9.31%
8 18 432 17.78% 18.63%
16 36 864 35.56% 37.26%
32 73 728 71.11% 74.51%
64 147 456 142.22% (exceeds!) 149.02%

At r = 64, LoRA has more parameters than full FT — because for r ≥ min(in, out) of a square layer, the LoRA decomposition has the same capacity as full FT but writes it as two matrices. So the "LoRA is parameter-efficient" claim holds only up to roughly r ≤ min(in, out) / 2.

For the §A13 grammar-tutor task, the canonical choices are r = 4 or r = 8 — small enough to be efficient, large enough to express irregular-verb-specific updates.

Why bigger models benefit more (Mini-GPT is unusually unflattering to LoRA)

Recall the theory file 02-parameter-count.md formula: ratio per layer = 1.5 r / h. For:

  • Mini-GPT (h = 64): ratio = 1.5 × 8 / 64 = 0.19 → 19% per square Linear at r = 8. Huge by LoRA standards.
  • LLaMA-7B (h = 4096): ratio = 1.5 × 8 / 4096 = 0.003 → 0.3%.
  • LLaMA-70B (h = 8192): ratio = 0.0015 → 0.15%.

Mini-GPT is small enough that LoRA's relative efficiency is mediocre. We use it anyway because:

  1. Pedagogy: the rank trade-off is measurable on a tiny model.
  2. CPU-feasibility: even rank-8 LoRA on Mini-GPT trains in seconds on Borja's i5-8250U; rank-8 on LLaMA-7B does not.
  3. Conceptual clarity: the formula 2304 r is closed-form and can be hand-checked.

For Phase 28's lab we use r = 8 as the default. Phase 32's grammar tutor will use the same rank.

The rank-vs-accuracy regime (predictions, not measurements)

LoRA accuracy on §A13 irregular-verb fine-tuning, expected:

Rank r LoRA params Expected eval accuracy (irreg verbs) Notes
0 0 0 base accuracy degenerate — no update possible (see /break exercise)
1 2 304 60-70% one-dimensional update; not enough for 8 irregulars
2 4 608 75-85% starting to fit
4 9 216 88-93% sweet spot
8 18 432 92-95% sweet spot; default
16 36 864 93-95% diminishing returns
32 73 728 93-95% overfits the small irreg set

Empirical pattern from the LoRA paper (Hu et al., 2021): the accuracy curve saturates quickly around r = 8. The Phase 28 DoD asks you to reproduce this saturation shape on Mini-GPT — the absolute numbers will shift, but the shape will not.

The "sweet spot at r = 8" is task-specific. For more complex updates (e.g., a domain shift from English to Spanish, not just irregular-verb correction), bigger r helps until you're effectively doing full FT.

How big the checkpoint is

A LoRA checkpoint at rank r in fp16 is 2 × N_LoRA(r) bytes:

Rank r LoRA params fp16 bytes vs Mini-GPT base (103 680 × 4 = 414 720)
4 9 216 18 432 (≈ 18 KiB) 4.4%
8 18 432 36 864 (≈ 36 KiB) 8.9%
16 36 864 73 728 (≈ 72 KiB) 17.8%

Even rank-16 fits comfortably in the same order of magnitude as the base; you could ship 30 distinct adapter heads for the same disk cost as one Mini-GPT copy.

Citations

  • Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.
  • Dettmers, Pagnoni, Holtzman, Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314.

One-paragraph recap

For Mini-GPT, the closed-form LoRA parameter count over all 12 Linear layers is exactly 2304 × r trainable parameters; at the default r = 8 that's 18 432 params, ≈ 18% of the model. The relative efficiency is unflattering because Mini-GPT's hidden dim is small — production transformers see 0.1-0.3% trainable fractions for the same rank. The rank-vs-accuracy curve saturates around r = 8 on the §A13 irregular-verb fine-tune; that saturation shape is the empirical signature you should reproduce. The B = 0 initialization ensures the model starts equal to the base; the α/r scaling decouples update magnitude from rank choice. The result: a robust, predictable parameter knob whose math you can verify by hand.

Next: lab/00-lora-by-hand.md to implement LoRALinear and verify the count.