Skip to content

English · Español

Lab 01 — LoRA Parameter and Memory Counts

🇪🇸 Pura aritmética: contar parámetros entrenables, calcular memoria. Sin entrenar nada. Lo que hace esta práctica reveladora es comparar full FT vs LoRA r=4/8/16 vs QLoRA al mismo modelo (MiniGPT con d_model=64, n_layer=4). El ratio importa más que el valor absoluto.

Anchors: theory/02-parameter-count.md, theory/03-memory-footprint.md.


What you produce

A script experiments/28-lora-counts/count.py that prints a parameter-count + memory-footprint table for MiniGPT under five training regimes:

  1. Full fine-tuning (fp16 mixed precision).
  2. LoRA r=4 (all Linears).
  3. LoRA r=8 (all Linears).
  4. LoRA r=16 (all Linears).
  5. QLoRA r=8 (all Linears, base in NF4).

Plus a markdown report experiments/28-lora-counts/REPORT.md discussing the ratios.

TODOs (sketch)

Block A — Parameter counts (pure arithmetic, no model load)

  1. Hard-code MiniGPT's shape from §A13: n_layer=4, n_head=4, d_model=64, vocab_size (read from Phase 13 artefact), d_ff = 4 * d_model = 256.
  2. Per layer:
  3. Attention: 4 Linears of (d_model, d_model) = (64, 64).
  4. MLP: (d_model, d_ff) = (64, 256) + (d_ff, d_model) = (256, 64).
  5. Full FT trainable params per layer: 4 * 64 * 64 + 64 * 256 + 256 * 64 = 16384 + 16384 + 16384 = 49152.
  6. LoRA r=8 per layer: attention 4 * (64+64) * 8 = 4096; MLP (64+256)*8 + (256+64)*8 = 2560 + 2560 = 5120; total 9216.
  7. Across all n_layer=4 layers, plus the embedding/lm_head (excluded from LoRA per BLUEPRINT default).

Block B — Memory footprint (mixed-precision regime)

Per theory 03's four-bucket model. For each regime, compute the bytes for:

  • Weights (fp16 + fp32 master if full FT; NF4 if QLoRA; fp16 if LoRA base).
  • Gradients (fp16, only for trainable).
  • Adam state (fp32 m and v, only for trainable).
  • Activations: rough estimate batch_size × seq_len × d_model × n_layer × 16 bytes (use batch_size=16, seq_len=32).

Block C — The table

Produce a Markdown table like:

Regime Trainable params Frozen params Weights Grads Adam Activations Total
Full FT (fp16+fp32) ... 0 ... ... ... ... ...
LoRA r=4 ... ... ... ... ... ... ...
...

Plus a "ratio vs full FT" column.

Block D — A bar chart

Matplotlib stacked bar (one bar per regime, stacks: weights / grads / Adam / activations). Save to experiments/28-lora-counts/memory_breakdown.png.

Block E — Sanity check against theory

The ratios should match theory 02's r/h for square layers (LoRA r=8 vs full FT on attention: ratio = (64+64)*8 / 64² = 1024/4096 = 25%). Larger than the LLaMA-scale r/h ≈ 0.2% — that's because our h=64 is tiny. Note this in the REPORT.

Constraints

  • No training. Pure arithmetic.
  • Numbers in the REPORT must be reproducible from count.py — don't hand-edit.
  • Cite theory 02 and 03 where the formulas come from.

Stop conditions

You're done when:

  1. python experiments/28-lora-counts/count.py produces the table.
  2. memory_breakdown.png exists and is legible.
  3. REPORT.md includes: the table, a 2-paragraph discussion of the ratios at MiniGPT-scale vs production-scale, and the explicit numerical answer to "What fraction of params does LoRA r=8 train?" for MiniGPT.
  4. Hand-computed sanity check matches count.py output (do at least one regime by hand).

Pitfalls (specific to this lab)

  1. Forgetting biases. A typical nn.Linear(in, out) has out × in + out params (with bias). Whether we include biases in the LoRA-able set depends on target_modules. Default per BLUEPRINT: bias stays frozen, not LoRA'd. Be explicit.
  2. Counting the lm_head twice. Some implementations tie the embedding to the lm_head. Check tie_weights flag.
  3. Activations dominated by batch × seq × d_model, not by params. Don't underestimate them.
  4. fp32 master copy for Adam. Full FT in mixed precision keeps an fp32 copy of weights for stability. Adds 4N bytes. Easy to miss.
  5. NF4 per-block scales. NF4 packs 2 weights/byte but also needs ~1 fp16 scale per 64-weight block. The overhead is 4/64 = 0.0625 byte/weight. Include it.

When to consult solutions

After your hand-check disagrees with count.py, debug yourself first. If you've spent >30 min and the discrepancy is still there, open solutions/01-lora-counts-ref.md.

Estimated time

2-3 hours.


Next: lab/02-lora-finetune.md.