English · Español

03 — Memory Footprint: Full FT vs LoRA vs QLoRA¶

🇪🇸 La memoria durante entrenamiento se va en cuatro cubetas: pesos, gradientes, estados del optimizador, activaciones. Full FT paga por las cuatro, todas multiplicadas por N(params). LoRA paga las tres primeras solo por los pocos params entrenables — los pesos siguen ahí pero sin gradiente ni Adam state. QLoRA además aplana los pesos congelados a NF4.

The four buckets¶

During training, GPU memory is consumed by:

Weights — the model parameters themselves.
Gradients — one buffer per parameter, accumulated during backward.
Optimizer state — Adam's first and second moments (m, v) per parameter.
Activations — intermediate tensors saved during forward for use in backward.

Plus a small amount of overhead (workspace, allocator metadata) — ignored.

For each bucket, the total bytes scale with the number of parameters or the activation footprint. The dominant scalers are:

Per parameter: weights (2B fp16 or 4B fp32), grads (same as weights), Adam (2 × 4B fp32 = 8B in mixed precision because Adam states stay fp32 for stability).
Per token of context per layer: activations (depends heavily on architecture and gradient checkpointing).

Full FT footprint¶

For a model with N parameters in mixed-precision training:

Weights: 2N bytes (fp16) + 4N bytes (fp32 master copy for Adam) = 6N.
Gradients: 2N bytes (fp16 grads).
Adam state: 8N bytes (m and v in fp32).
Activations: depends — typically ~ batch × seq_len × hidden × layers × small_const bytes.

Total per-parameter cost (excluding activations): 16 N bytes.

For a 7B model: 16 × 7e9 = 112 GiB. Plus activations: easily another 20–40 GiB for sensible batch sizes.

This is why training 7B+ models takes multi-GPU setups (4-8 A100s typical for full FT).

LoRA footprint¶

LoRA freezes the base. The base contributes only to bucket 1 (weights), once, in inference precision.

Frozen base weights: 2N bytes (fp16). No fp32 master copy, no gradient buffer, no Adam state. Just the weights, used for forward pass.
LoRA trainable params: call it N_lora. In mixed-precision: 6 N_lora (fp16 + fp32 master) + 2 N_lora (grads) + 8 N_lora (Adam) = 16 N_lora.
Activations: same as full FT — depends on architecture. LoRA adds a couple of small extra forward ops but doesn't change activation memory materially.

Total: 2N + 16 N_lora bytes + activations.

For LLaMA-7B with LoRA r=8 (N_lora ≈ 19M from theory 02):

Base: 2 × 7e9 = 14 GiB.
LoRA buckets: 16 × 19e6 = 304 MiB.
Total (no activations): ~14.3 GiB.

Compare to full FT's 112 GiB: ~8× less.

QLoRA footprint¶

QLoRA quantizes the base to NF4 (4 bits per weight). The frozen base now occupies:

0.5N bytes (NF4 packing: 2 weights per byte).
Plus the NF4 scales: typically 4 × N / 64 = N/16 bytes (one fp16 scale per block of 64 weights).

Total frozen base: ~0.56 N bytes. 3.6× smaller than fp16.

LoRA trainable params stay fp16: 16 N_lora bytes total. Unchanged from LoRA.

For LLaMA-7B with QLoRA r=8:

Base (NF4): 0.56 × 7e9 = 3.9 GiB.
LoRA: 304 MiB.
Total (no activations): ~4.2 GiB.

Compare to full FT's 112 GiB: ~27× less. Compare to LoRA fp16's 14 GiB: ~3.3× less.

A 24 GiB consumer GPU (RTX 4090, A10G) can fit QLoRA fine-tuning of a 13B model with room for activations. This is the unlock that made 2023's open-source LLM fine-tuning explosion possible.

What about gradient checkpointing?¶

Gradient checkpointing trades compute for memory: instead of saving every layer's activation for backward, save only some — and recompute the rest on demand during backward. The standard "save every Nth layer" setup gives a ~sqrt(L) reduction in activation memory at the cost of a single extra forward pass per backward (~30% more compute).

Universal across full FT, LoRA, QLoRA. Doesn't change the relative footprint of the three regimes — it just shrinks the activation bucket in all of them.

What about paged optimizer states?¶

QLoRA's full recipe also uses paged optimizer states: NVIDIA's CUDA Unified Memory lets you allocate Adam state in CPU RAM, and the driver pages it in/out as needed. The result: Adam state's 8 N_lora bytes don't have to be in GPU memory simultaneously.

For our small MiniGPT, Adam state is small enough that paging is unnecessary. For 70B QLoRA, it's what makes the difference between fitting on one A100 and not.

We mention but don't implement paged Adam in Phase 28.

Memory ablation table¶

For MiniGPT (Phase 17, ~85M params) with r=8, fp16 mixed-precision, sensible batch size:

Setting	Weights	Grad	Adam	Total trainable	Frozen base	Approx total
Full FT (fp16+fp32 master)	510 MiB	170 MiB	680 MiB	1.36 GiB	—	~1.4 GiB
LoRA r=8	20.8 MiB	2.6 MiB	10.4 MiB	33.8 MiB	170 MiB	~200 MiB
QLoRA r=8	20.8 MiB	2.6 MiB	10.4 MiB	33.8 MiB	48 MiB	~80 MiB

(Numbers rounded; doesn't include activations. For a real benchmark on Borja's hardware, lab 03 measures actual values via torch.cuda.max_memory_allocated.)

The reduction is modest in absolute terms for MiniGPT — but the relative ratios match the LLaMA-scale numbers. The point of Phase 28's QLoRA preview isn't to save memory we don't need; it's to demonstrate the recipe and measure the ratios.

What scales when N is bigger¶

The table above is per parameter; it scales linearly with N for buckets 1–3. So the 7B and 70B numbers we cited follow directly from the same per-parameter cost.

Activation memory scales differently — per token and per layer, mostly independent of N for a fixed sequence length and batch size. This is why gradient checkpointing is a separate axis from LoRA: LoRA fixes buckets 2–3, but activations are bucket 4 and need their own treatment.

QLoRA's three-piece innovation¶

Dettmers et al. (QLoRA, 2023) combined three things:

NF4 quantization of frozen base. Theory: Phase 26 file 03. Reduces bucket 1.
Double quantization of NF4 scales. Quantizes the fp16 per-block scales themselves to FP8, saving another ~0.5 bytes per weight on average. Negligible for our model size; covered in Phase 26.
Paged optimizer states. Adam state in CPU RAM, paged in/out. Reduces bucket 3 dynamically.

Together: a 7B model fine-tunes in <6 GiB.

For Phase 28 we use (1) only. (2) and (3) are conceptual additions; implementing them properly requires hardware-specific glue (bitsandbytes for paged optimizers) that distracts from the core algorithm.

Borja's machine implications¶

learners/borja/profile.md notes: CPU-only at Phase 28. Wait — but training needs a GPU, right?

Yes. Phase 23+ assumed a move to cloud GPU. By Phase 28, the cloud platform is settled (decision made at Phase 23). The fine-tuning runs in lab 02 happen on cloud GPU. The CPU fallback is "compute the param-count math and run a 1-epoch dry run; full training is on cloud".

Memory measurements for the QLoRA preview need a real GPU; the synthetic numerical results can be approximated on CPU. The DoD requires the cloud run.

Drill problems¶

Solutions at phase open in solutions/03-memory-footprint-ref.md.

A 13B model with LoRA r=16 across all Linears. Compute N_lora. Then compute the total memory in three regimes (full FT, LoRA fp16, QLoRA NF4). Compare to a 24 GiB consumer GPU.
The "double quantization" trick of QLoRA quantizes the per-block scales (one fp16 per 64 weights) to FP8 with an outer scale per 256-block. Compute the per-weight overhead before and after. Show the ~0.5 bits/weight savings.
Adam state is 2 × 4N = 8N bytes in fp32. If Adam state is kept in fp16 (or quantized via paged states), how does the total memory change? Argue why fp32 is the conventional choice.
Gradient checkpointing trades 30% extra compute for ~sqrt(L) less activation memory. Estimate the wall-clock impact on a fine-tuning run if compute (not memory) is the bottleneck. Does checkpointing still make sense?

One-paragraph recap¶

Training memory has four buckets: weights, gradients, optimizer state, activations. Full fine-tuning pays for all four scaled by total N. LoRA pays for buckets 2–3 only on the small N_lora of trainable params, leaving the frozen base in bucket 1 with no extra storage. QLoRA additionally compresses bucket 1 to ~0.56 bytes/param via NF4 quantization, plus optionally pages Adam state to CPU memory. The combined effect is order-of-magnitude memory savings: a 7B model that needed 112 GiB for full FT fits in 4 GiB with QLoRA. Phase 28 demonstrates the recipe on MiniGPT (where absolute savings are small but ratios match) and previews QLoRA. Next theory file surveys alignment training (DPO/RLHF) for cultural completeness.

Next: theory/04-alignment-survey.md.