English · Español
Lab 02 — Parameter inventory: count every parameter¶
Read
theory/02-ffn-and-activations.md(§"FFN parameter count") andPHASE_17_PLAN.md§6 (at repo root). Do not consultsolutions/.
Objective¶
Walk through your Mini-GPT layer-by-layer and count every single parameter. Match the count to a closed-form formula. Plot the distribution as a stacked bar (where does the parameter mass live?). When you finish this lab you should be able to write down the parameter count of any Mini-GPT variant from its config alone.
Background¶
The closed-form parameter count for Mini-GPT (tied embeddings, biases on FFN, no biases on Q/K/V projections, LayerNorm with scale + shift):
For Mini-GPT (\(d_\text{model} = 64, n_\text{heads} = 4, n_\text{layers} = 2, d_\text{ff} = 256, |V| = 64\)):
- Tied embedding: \(64 \cdot 64 = 4{,}096\)
- Per block: \(4 \cdot 64^2 + 2 \cdot 64 \cdot 256 + 256 + 64 + 4 \cdot 64 = 16{,}384 + 32{,}768 + 256 + 64 + 256 = 49{,}728\)
- Two blocks: \(99{,}456\)
- Final LN: \(128\)
- Total: \(4{,}096 + 99{,}456 + 128 = 103{,}680\) parameters.
If your earlier hand count was ~57k, you forgot to include the FFN bias or the LayerNorm parameters. Re-derive.
Tasks¶
Task 1 — programmatic walk¶
In a script scripts/phase17_param_inventory.py:
def inventory(model: MiniGPT) -> dict[str, int]:
"""Return {component_name: param_count} for every parameter in the model."""
Components to enumerate (suggested keys):
E(embedding, tied)block_0.attn.W_q,W_k,W_v,W_oblock_0.ffn.W_1,b_1,W_2,b_2block_0.ln1.gamma,beta,ln2.gamma,beta- (same for
block_1) ln_final.gamma,beta
Sum all entries; assert equal to the closed-form prediction (103,680 for the locked config) within zero error — this is an integer counting, not a numerical computation.
Task 2 — formula derivation in your notes¶
In learners/borja/phase-17/notes.md:
- Derive the per-component count for one transformer block from first principles. Show your work — e.g., "Q projection: maps \(d_\text{model}\) → \(d_\text{model}\), no bias, so \(d_\text{model}^2 = 4096\)."
- Sum to get the formula. Plug in. Match to your programmatic walk.
- Repeat for the variant with biases on Q/K/V projections (GPT-2 has them; LLaMA does not). New formula. New total. (Roughly \(+4 d_\text{model} n_\text{layers} = +512\) for Mini-GPT.)
Task 3 — stacked bar visualisation¶
Produce a stacked bar plot with one column per architectural component:
parameter group params
─────────────────────────────────
embedding (tied) 4,096
block 0 attention 16,384
block 0 FFN 33,088
block 0 LayerNorms 256
block 1 attention 16,384
block 1 FFN 33,088
block 1 LayerNorms 256
final LayerNorm 128
─────────────────────────────────
TOTAL 103,680
Save as experiments/<date>-phase-17-param-inventory/distribution.png. Use a horizontal bar chart for readability.
Task 4 — comparison to GPT-2 small (just the numbers, for scale)¶
GPT-2 small: \(d_\text{model} = 768, n_\text{heads} = 12, n_\text{layers} = 12, d_\text{ff} = 3072, |V| = 50257\).
Compute the parameter count using your formula. Compare to the published "117M". (Note: 117M is sometimes quoted, 124M elsewhere — the difference is whether tied embeddings are counted; clarify which convention matches your formula.)
This is a paper-only exercise — you're not building GPT-2, just confirming your formula generalises.
Task 5 — what changes when you scale?¶
For three scaled-up Mini-GPT variants:
| Variant | \(d_\text{model}\) | \(n_\text{layers}\) | \(d_\text{ff}\) | \(|V|\) | Total params (your formula) | |---|---|---|---|---|---| | Mini-GPT (locked) | 64 | 2 | 256 | 64 | 103,680 | | Mini-GPT-medium | 128 | 4 | 512 | 64 | ? | | Mini-GPT-large | 256 | 6 | 1024 | 64 | ? |
Compute and tabulate. Observe how the FFN parameter fraction grows with depth.
Measurements to capture¶
- Programmatic count:
103,680exactly (for the locked config). - Closed-form match: ✓ or ✗.
- Stacked bar plot saved.
- Scaling table for the three variants.
Acceptance¶
-
scripts/phase17_param_inventory.pyreturns the per-component dict. - Sum equals the closed-form prediction exactly.
- Notes contain the formula derivation in your own words.
- Stacked bar plot saved.
- Scaling table populated.
- Lab notes identify the FFN as ~64% of per-block params and ~64% of total (for the locked config).
Pitfalls to expect¶
- Forgetting the final LayerNorm. Common. Your count comes up 128 short; investigate before declaring victory.
- Counting an untied LM head. If you accidentally created
self.W_LMas a separate matrix inmini_gpt.py, your count comes up \(|V| \cdot d_\text{model} = 4096\) too high. Re-readtheory/03-tied-embeddings-and-lm-head.md. - Off-by-one on the bias decision. Decide once whether your Q/K/V projections have biases. GPT-2: yes; LLaMA: no; Mini-GPT default: no (cleaner formula). Stay consistent across
block.py,mini_gpt.py, and this lab. - Counting in-place buffers. RoPE precomputes a cosine/sine table; that's a buffer, not a parameter. Do not include it in the parameter count.