English · Español
Lab 01 — LoRA Parameter and Memory Counts¶
🇪🇸 Pura aritmética: contar parámetros entrenables, calcular memoria. Sin entrenar nada. Lo que hace esta práctica reveladora es comparar full FT vs LoRA r=4/8/16 vs QLoRA al mismo modelo (MiniGPT con
d_model=64, n_layer=4). El ratio importa más que el valor absoluto.
Anchors: theory/02-parameter-count.md, theory/03-memory-footprint.md.
What you produce¶
A script experiments/28-lora-counts/count.py that prints a parameter-count + memory-footprint table for MiniGPT under five training regimes:
- Full fine-tuning (fp16 mixed precision).
- LoRA r=4 (all Linears).
- LoRA r=8 (all Linears).
- LoRA r=16 (all Linears).
- QLoRA r=8 (all Linears, base in NF4).
Plus a markdown report experiments/28-lora-counts/REPORT.md discussing the ratios.
TODOs (sketch)¶
Block A — Parameter counts (pure arithmetic, no model load)¶
- Hard-code MiniGPT's shape from §A13:
n_layer=4,n_head=4,d_model=64,vocab_size(read from Phase 13 artefact),d_ff = 4 * d_model = 256. - Per layer:
- Attention: 4 Linears of
(d_model, d_model) = (64, 64). - MLP:
(d_model, d_ff) = (64, 256)+(d_ff, d_model) = (256, 64). - Full FT trainable params per layer:
4 * 64 * 64 + 64 * 256 + 256 * 64 = 16384 + 16384 + 16384 = 49152. - LoRA r=8 per layer: attention
4 * (64+64) * 8 = 4096; MLP(64+256)*8 + (256+64)*8 = 2560 + 2560 = 5120; total9216. - Across all
n_layer=4layers, plus the embedding/lm_head (excluded from LoRA per BLUEPRINT default).
Block B — Memory footprint (mixed-precision regime)¶
Per theory 03's four-bucket model. For each regime, compute the bytes for:
- Weights (fp16 + fp32 master if full FT; NF4 if QLoRA; fp16 if LoRA base).
- Gradients (fp16, only for trainable).
- Adam state (fp32
mandv, only for trainable). - Activations: rough estimate
batch_size × seq_len × d_model × n_layer × 16 bytes(usebatch_size=16, seq_len=32).
Block C — The table¶
Produce a Markdown table like:
| Regime | Trainable params | Frozen params | Weights | Grads | Adam | Activations | Total |
|---|---|---|---|---|---|---|---|
| Full FT (fp16+fp32) | ... | 0 | ... | ... | ... | ... | ... |
| LoRA r=4 | ... | ... | ... | ... | ... | ... | ... |
| ... |
Plus a "ratio vs full FT" column.
Block D — A bar chart¶
Matplotlib stacked bar (one bar per regime, stacks: weights / grads / Adam / activations). Save to experiments/28-lora-counts/memory_breakdown.png.
Block E — Sanity check against theory¶
The ratios should match theory 02's r/h for square layers (LoRA r=8 vs full FT on attention: ratio = (64+64)*8 / 64² = 1024/4096 = 25%). Larger than the LLaMA-scale r/h ≈ 0.2% — that's because our h=64 is tiny. Note this in the REPORT.
Constraints¶
- No training. Pure arithmetic.
- Numbers in the REPORT must be reproducible from
count.py— don't hand-edit. - Cite theory 02 and 03 where the formulas come from.
Stop conditions¶
You're done when:
python experiments/28-lora-counts/count.pyproduces the table.memory_breakdown.pngexists and is legible.REPORT.mdincludes: the table, a 2-paragraph discussion of the ratios at MiniGPT-scale vs production-scale, and the explicit numerical answer to "What fraction of params does LoRA r=8 train?" for MiniGPT.- Hand-computed sanity check matches
count.pyoutput (do at least one regime by hand).
Pitfalls (specific to this lab)¶
- Forgetting biases. A typical
nn.Linear(in, out)hasout × in + outparams (with bias). Whether we include biases in the LoRA-able set depends ontarget_modules. Default per BLUEPRINT: bias stays frozen, not LoRA'd. Be explicit. - Counting the lm_head twice. Some implementations tie the embedding to the lm_head. Check
tie_weightsflag. - Activations dominated by batch × seq × d_model, not by params. Don't underestimate them.
- fp32 master copy for Adam. Full FT in mixed precision keeps an fp32 copy of weights for stability. Adds
4Nbytes. Easy to miss. - NF4 per-block scales. NF4 packs 2 weights/byte but also needs ~1 fp16 scale per 64-weight block. The overhead is
4/64 = 0.0625 byte/weight. Include it.
When to consult solutions¶
After your hand-check disagrees with count.py, debug yourself first. If you've spent >30 min and the discrepancy is still there, open solutions/01-lora-counts-ref.md.
Estimated time¶
2-3 hours.
Next: lab/02-lora-finetune.md.