English · Español

00 — Why Fine-Tuning Needed Reform¶

🇪🇸 El fine-tuning completo desperdicia recursos: ajustas todos los pesos para enseñar una habilidad pequeña. LoRA explota que los modelos preentrenados están sobre-parametrizados — los cambios útiles viven en un subespacio de rango bajo. Esta es una idea elegante y empíricamente sólida, no un truco de ingeniería.

The full fine-tuning problem¶

Suppose you have a 7B-parameter pretrained model and you want to teach it one new skill — say, "be especially good at catching irregular-verb conjugation errors in English, like goed → went". You have ~500 training examples drawn from the irregular-verb subset of the Phase 12 verb corpus.

The straightforward approach: full supervised fine-tuning (SFT). Update every weight in the model with gradient descent on cross-entropy loss against the new examples. After enough epochs, the model learns the skill.

The straightforward approach has three problems:

Memory. Storing gradients (7B × 2 bytes fp16), Adam state (7B × 4 × 2 = 56 GiB fp32 m and v), plus the weights (7B × 2 = 14 GiB fp16). Total: ~84 GiB. Bigger than any single consumer GPU.
Compute. A backward pass over 7B parameters per training example is comparable to the forward — you're doing inference-scale work twice per step, for hundreds of steps.
Catastrophic forgetting. Updating all 7B parameters on a narrow task can degrade unrelated capabilities. The model that perfectly corrects goed → went now mis-conjugates regular -ed verbs that it handled fine before.

We need a method that:

Updates few parameters (low memory).
Preserves most pretrained weight (no forgetting).
Is mathematically clean enough to reason about.

LoRA delivers all three. Here's why it works.

The over-parameterization observation¶

Pretrained models have far more parameters than the task needs. The reason: pretraining objectives (next-token prediction on the internet) require capacity to fit every conceivable distribution. Any specific downstream task — a chatbot persona, a code style, a domain — needs only a tiny slice of that capacity.

Empirically: if you ask "how much do the weights actually change during fine-tuning?", the answer is "very little, and along very few directions". Specifically, the update matrix ΔW = W_finetuned − W_pretrained is approximately low-rank — even though W itself has full rank.

Hu et al. (LoRA, 2021) made this observation precise: for many fine-tuning tasks, ΔW has a "stable rank" of ~10 even for W ∈ ℝ^{4096 × 4096}. The rest of the singular values are noise.

This is a strong empirical claim. It says: the useful information added by fine-tuning lives in a r ≪ min(in, out) dimensional subspace.

The LoRA construction¶

If ΔW is approximately rank-r, parametrize it directly as rank-r:

\[ \Delta W = B A, \quad B \in \mathbb{R}^{out \times r}, \quad A \in \mathbb{R}^{r \times in} \]

The forward computation:

\[ y = W x + b + \frac{\alpha}{r} \cdot B (A x) \]

W, b are frozen at pretrained values. A, B are trained. α is a scaling constant (typically 16 or 32); α/r makes the scaling invariant to r choice.

Training: gradients flow to A, B only. Optimizer state for A, B only. The frozen W doesn't need its gradient buffer.

Inference: optionally merge A, B into W at serving time: W_eff = W + (α/r) B A. Same compute as plain Linear. Or keep them separate to swap adapters per request.

What you save¶

For one Linear(768, 768):

Full FT: 590K trainable params; 590K grads; 590K × 2 Adam states = 590K × 4 buffers in mixed precision. Total ~4.7 MiB.
LoRA r=8: 12K trainable params; ~96 KiB total state. 50× less.

Across a full model with hundreds of Linears, the savings multiply.

What you don't lose¶

A standard concern: a rank-r update is strictly less expressive than a full update. Doesn't accuracy suffer?

Empirically, no — for tasks within the pretrained model's general capabilities. Hu et al.'s ablations show LoRA r=4 matches full FT on GLUE; r=8 matches full FT on SuperGLUE. The reasoning: task-specific information is low-rank, so a low-rank parametrization is sufficient.

For tasks outside the pretrained model's distribution — e.g., teaching English-trained model to do well on Swahili — LoRA can underperform full FT because the required ΔW isn't low-rank for such a large distributional shift. Most production fine-tuning isn't this scenario.

Our Phase 28 specialization — "bias the model toward catching irregular-verb errors" — is clearly within distribution: the pretrained MiniGPT has already seen all 8 irregular verbs (be, have, do, go, come, see, eat, write) during Phase 17 training. We're just nudging the distribution mass for "what does go → past simple look like" away from the regular-form prior. That's a textbook low-rank perturbation.

QLoRA: combining ideas¶

QLoRA (Dettmers et al., 2023) noticed that the base weights W in LoRA never need full precision either, because they're frozen. Quantize them aggressively (NF4 — see Phase 26 theory 03). The LoRA matrices A, B stay in fp16 (small enough that the storage doesn't matter).

The forward becomes:

\[ y = \text{dequant}(W_{\text{NF4}}) x + b + \frac{\alpha}{r} B (A x) \]

The dequantization happens on the fly per layer. Memory savings: 4× on the base weights — which dominate the total footprint.

A 7B model: - LoRA fp16: 14 GiB base + ~7 MiB LoRA + 56 MiB Adam (for LoRA only) ≈ 14 GiB. - QLoRA NF4: 3.5 GiB base + 7 MiB LoRA + 56 MiB Adam ≈ 3.6 GiB.

A 7B model fine-tunes in <6 GiB of GPU RAM. Consumer-grade. This is why every fine-tuning experiment in the open-source LLM community since mid-2023 uses QLoRA.

What Phase 28 actually does¶

Our MiniGPT is small — millions of params, not billions. LoRA's memory benefit at this scale is modest; the parameter-count argument still holds (LoRA's trainable set is much smaller than the full weight set), but the absolute byte savings are marginal.

That's fine. Phase 28 is about understanding, not about achieving the largest possible savings. We:

Implement LoRALinear from scratch (lives in src/minituner/).
Derive the param/memory math symbolically.
Fine-tune MiniGPT on the irregular-verb subset of the Phase 12 corpus — specifically, training pairs of the form (prompt = "He __ to school yesterday", target = "went") for the 8 irregular verbs.
Show empirically that the base task PPL (held-out regular-verb sentences) is preserved (no forgetting).
Sweep rank to find the diminishing-returns elbow.
Preview QLoRA by combining src/miniquant/'s NF4 with our LoRALinear.

The QLoRA preview won't show dramatic memory savings because MiniGPT is too small. It will show the integration — the engineering pattern that scales to 7B+ in the same code.

One concrete numerical example¶

A standard Linear(in=4096, out=4096):

Full FT trainable params: 4096² = 16.78 M.
LoRA r=8 trainable params: 4096 × 8 + 8 × 4096 = 65 K. 256× less.
LoRA r=16 trainable params: 4096 × 16 × 2 = 131 K. 128× less.

For a 32-layer transformer with 4 Linears per layer (Q, K, V, output projection) all of size (4096, 4096):

Full FT: 32 × 4 × 16.78M = 2.15 G trainable params.
LoRA r=8: 32 × 4 × 65K = 8.3 M trainable params. Less than 0.4% of full.

Add the MLP Linears (typically (4096, 16384) and (16384, 4096) per layer):

Per layer MLP params: 4096 × 16384 + 16384 × 4096 = 134 M.
Per layer MLP LoRA r=8: (4096 + 16384) × 8 × 2 = 328 K. 408× less.

Across 32 layers MLP + attention: full FT ~7 G params; LoRA ~30 M params. Two orders of magnitude difference.

For a 7B model, the corresponding numbers are similar in ratio: LoRA r=8 across all Linears = ~50 M trainable params, out of 7 G total. ~0.7% trainable.

Catastrophic forgetting in one line¶

When you train all 2.15 G attention parameters on a 500-example task, every gradient step nudges them. Across 1000 steps, the cumulative drift is enough to noticeably degrade unrelated capabilities — the original distribution the weights were "good for" is no longer the distribution you're optimizing.

LoRA's trick: the frozen 2.15 G parameters can't drift. Only the 8.3 M new LoRA params absorb the gradient. The pretrained capability is preserved structurally.

In lab 02 we measure this: PPL on the regular-verb control split before and after fine-tuning, for both full FT (drifts up) and LoRA (stays flat). Concretely: train on the 8 irregular verbs; eval on sentences with the 12 regular verbs (work, play, walk, talk, …). LoRA should leave the regular-verb PPL flat (within 5%); full FT will visibly drift.

What this phase does NOT cover¶

RLHF / DPO / PPO actually implemented. Surveyed only in theory 04.
Adapter methods other than LoRA (prefix tuning, prompt tuning, IA³). Mentioned, not implemented.
Multi-adapter management (loading/unloading adapters at serve time). Phase 31 territory.
Fine-tuning a 7B+ model. Our model is MiniGPT; the math scales, the experiment doesn't need to.
Hyperparameter search at scale. One LR + one batch size + one decay schedule.
Reward model training. Out of scope.
Spanish-pair-specific fine-tuning. The bilingual signal is already in the corpus; LoRA naturally absorbs it. No special-case handling.

One-paragraph recap¶

Full fine-tuning updates all of a pretrained model's parameters — wasteful in memory and compute, and prone to catastrophic forgetting. The key empirical observation enabling LoRA is that fine-tuning's useful update lives in a low-rank subspace of weight space. LoRA parametrizes this directly: two small matrices A, B whose product BA is added to each frozen W. Trainable parameter count drops by 50–400×, memory drops correspondingly, and the frozen base preserves pretrained capability. QLoRA pushes further by quantizing the frozen base to NF4 — combining Phase 26's quantization with this phase's adapters. The rest of Phase 28 derives the math precisely and runs the recipe on MiniGPT.

Next: theory/01-sft-and-forgetting.md.