English · Español

02 — Parameter Count: LoRA vs Full Fine-Tuning¶

🇪🇸 La aritmética es simple pero el resultado importa: un Linear (768, 768) tiene 590K parámetros entrenables en full FT. Con LoRA r=8: 12K. Cuando lo escalas a un transformer entero (cientos de Linears), la ratio sigue. Esta sección formaliza esa cuenta.

The setup¶

A standard Linear(in, out) layer has:

Weight matrix W ∈ ℝ^{out × in}: out × in parameters.
Bias b ∈ ℝ^{out}: out parameters. (Often omitted in modern transformers; we keep it for generality.)

Total: out × in + out parameters.

With LoRA at rank r:

The base W, b are frozen — they exist but don't update.
Two new matrices: A ∈ ℝ^{r × in} and B ∈ ℝ^{out × r}.

Trainable LoRA parameters: r × in + out × r = r × (in + out).

The ratio¶

The ratio of trainable parameters (LoRA over full):

\[ \text{ratio} = \frac{r(in + out)}{out \cdot in + out} \approx \frac{r(in + out)}{out \cdot in} \quad \text{for } out \gg 1 \]

For a square layer in = out = h:

\[ \text{ratio} \approx \frac{2rh}{h^2} = \frac{2r}{h} \]

For h = 768, r = 8: ratio = 16/768 ≈ 2.1%. For h = 4096, r = 8: ratio = 16/4096 ≈ 0.4%. For h = 4096, r = 16: ratio = 0.8%.

Bigger models benefit more. As h grows, the LoRA ratio shrinks proportionally — the rank r doesn't grow with the model.

At the full-model scale¶

Let's compute for a generic transformer:

L layers.
Per layer: 4 attention Linears of shape (h, h) + 2 MLP Linears of shape (h, 4h) and (4h, h).

Full FT trainable params per layer (excluding biases, embeddings, layer-norms — all small):

Attention: 4 × h².
MLP: 2 × 4h² = 8h².
Total per layer: 12 h².

LoRA params per layer at rank r, assuming we LoRA-fy all 6 Linears:

Attention (4 Linears with in=out=h): 4 × 2rh = 8rh.
MLP fc1 (h, 4h): r(h + 4h) = 5rh.
MLP fc2 (4h, h): same, 5rh.
Total per layer: 8rh + 10rh = 18 rh.

Ratio per layer: 18rh / 12h² = 1.5 r / h.

For a typical transformer:

Model	h	L	Full params (~)	LoRA r=8 params (~)	LoRA / Full
MiniGPT (Phase 17)	768	12	`12 × 12 × 768² = 85 M`	`12 × 18 × 8 × 768 = 1.3 M`	1.6%
LLaMA-7B	4096	32	`32 × 12 × 4096² ≈ 6.4 B`	`32 × 18 × 8 × 4096 ≈ 19 M`	0.30%
LLaMA-70B	8192	80	`80 × 12 × 8192² ≈ 64 B`	`80 × 18 × 8 × 8192 ≈ 94 M`	0.15%

(Numbers are approximate — they ignore embeddings, biases, layer-norm, and exact head structure. The order-of-magnitude is correct.)

For LLaMA-70B, LoRA r=8 trains 0.15% of the parameters. This is the ratio that makes single-GPU fine-tuning of giant models tractable.

Where the LoRA params actually go¶

Most published LoRA configurations target a subset of the Linears. The original LoRA paper put adapters only on W_q and W_v (the query and value attention projections). QLoRA's default targets all Linears.

We default to all Linears in Phase 28 — gives more capacity for the same per-Linear rank, more pedagogically clean (one rule applies everywhere).

What this gains in storage¶

A LoRA checkpoint is just A, B for each adapted Linear. Sizes:

For MiniGPT r=8: ~1.3 M params × 2 bytes (fp16) = 2.6 MiB.
For LLaMA-7B r=8: ~19 M × 2 bytes = 38 MiB.

Vs the base model checkpoint (MiniGPT 85 M × 2 = 170 MiB; LLaMA-7B 7 B × 2 = 14 GiB), LoRA adapters are 1–2 orders of magnitude smaller.

This enables adapter zoos: a single base model loaded once into GPU memory, plus dozens of fine-tuned adapters available on disk. Serving different "personalities" or "skills" means hot-swapping a 38 MiB file in/out — not loading a new 14 GiB model.

How `α/r` works¶

LoRA's forward:

\[ y = Wx + b + \frac{\alpha}{r} BAx \]

The scaling α/r decouples the update magnitude from the rank choice. Concretely: if you double r (more capacity), BA's singular values scale with r in the random-initialization regime. Dividing by r keeps the effective update magnitude similar across rank choices.

α is then a single "update strength" hyperparameter, independent of r. Typical: α = 16.

If you forget the α/r scaling and just use BA, doubling r doubles the effective learning rate — confusing experiments. The convention is universal; follow it.

Initialization¶

Standard LoRA initialization:

A ~ N(0, 1/r) (or scaled-uniform, depends on implementation).
B = 0 (exact zero).

So at step 0: BA = 0, and y = Wx + b + 0. The model behaves exactly like the pretrained model. The first gradient step displaces both A and B slightly; B was zero and becomes small; the product BA becomes a small low-rank update.

Why not initialize both randomly? If both A and B start random with the same scale, BA is non-zero with a wild distribution, and step 0 already jolts the model away from W. Slower convergence, worse forgetting profile.

Why B = 0 and not A = 0? Symmetric — either works. The convention is B = 0; we follow.

A pitfall: parameter-count traps¶

When reporting "LoRA trains 0.7% of parameters", be careful: this is the fraction of trainable parameters. The total memory footprint also includes the frozen base model. Saying "LoRA reduces memory by 99.3%" is wrong — most of the reduction comes from not allocating Adam state, gradients, and master copies for the frozen weights, not from the trainable count itself. Theory 03 derives the memory footprint precisely.

Drill problems¶

Solutions at phase open in solutions/02-parameter-count-ref.md.

For a Linear (in=4096, out=11008) (LLaMA MLP fc1 shape) with LoRA r=16, compute the trainable parameter count and the ratio vs full FT.
The MLP layers contribute 2 × 4h² parameters per layer; attention contributes 4 × h². What's the fraction of attention params in the LoRA budget vs MLP params? Does targeting only attention (original LoRA) give more or less capacity than targeting only MLP?
If you fix the LoRA budget at M total trainable parameters, what's the optimal rank for a square Linear (h, h)? (Hint: it's bounded by r ≤ h; for M < h^2, every choice is valid.)
Show that for r = h, LoRA can in principle represent any update ΔW (since BA can be any (out, in) matrix of rank ≤ h = min(out, in)). In what sense is LoRA r=h identical to full FT, and in what sense is it still different?

One-paragraph recap¶

LoRA replaces a Linear's out × in trainable parameters with two matrices totalling r × (in + out) — typically two orders of magnitude fewer. The ratio scales as r/h for square layers, so bigger models benefit more (fixed r, growing h). Across a transformer, applying LoRA to all Linears gives ~0.15–2% trainable ratio. The pedagogical takeaway: rank r is a single small dial that bounds how much you change the model, decoupled from model size. Combined with the α/r scaling convention and B=0 initialization, LoRA gives a clean, robust knob for the parameter budget. The next theory file translates this into memory footprint — the practically dominant cost during training.

Next: theory/03-memory-footprint.md.