English · Español

Lab 02 — Parameter inventory: count every parameter¶

Read theory/02-ffn-and-activations.md (§"FFN parameter count") and PHASE_17_PLAN.md §6 (at repo root). Do not consult solutions/.

Objective¶

Walk through your Mini-GPT layer-by-layer and count every single parameter. Match the count to a closed-form formula. Plot the distribution as a stacked bar (where does the parameter mass live?). When you finish this lab you should be able to write down the parameter count of any Mini-GPT variant from its config alone.

Background¶

The closed-form parameter count for Mini-GPT (tied embeddings, biases on FFN, no biases on Q/K/V projections, LayerNorm with scale + shift):

\[ |\theta| = \underbrace{|V| \cdot d_\text{model}}_\text{tied embedding} + n_\text{layers} \Big[ \underbrace{4 d_\text{model}^2}_\text{Q,K,V,O proj} + \underbrace{2 d_\text{model} d_\text{ff} + d_\text{ff} + d_\text{model}}_\text{FFN} + \underbrace{4 d_\text{model}}_\text{2 LN's} \Big] + \underbrace{2 d_\text{model}}_\text{final LN} \]

For Mini-GPT (\(d_\text{model} = 64, n_\text{heads} = 4, n_\text{layers} = 2, d_\text{ff} = 256, |V| = 64\)):

Tied embedding: \(64 \cdot 64 = 4{,}096\)
Per block: \(4 \cdot 64^2 + 2 \cdot 64 \cdot 256 + 256 + 64 + 4 \cdot 64 = 16{,}384 + 32{,}768 + 256 + 64 + 256 = 49{,}728\)
Two blocks: \(99{,}456\)
Final LN: \(128\)
Total: \(4{,}096 + 99{,}456 + 128 = 103{,}680\) parameters.

If your earlier hand count was ~57k, you forgot to include the FFN bias or the LayerNorm parameters. Re-derive.

Tasks¶

Task 1 — programmatic walk¶

In a script scripts/phase17_param_inventory.py:

def inventory(model: MiniGPT) -> dict[str, int]:
    """Return {component_name: param_count} for every parameter in the model."""

Components to enumerate (suggested keys):

E (embedding, tied)
block_0.attn.W_q, W_k, W_v, W_o
block_0.ffn.W_1, b_1, W_2, b_2
block_0.ln1.gamma, beta, ln2.gamma, beta
(same for block_1)
ln_final.gamma, beta

Sum all entries; assert equal to the closed-form prediction (103,680 for the locked config) within zero error — this is an integer counting, not a numerical computation.

Task 2 — formula derivation in your notes¶

In learners/borja/phase-17/notes.md:

Derive the per-component count for one transformer block from first principles. Show your work — e.g., "Q projection: maps \(d_\text{model}\) → \(d_\text{model}\), no bias, so \(d_\text{model}^2 = 4096\)."
Sum to get the formula. Plug in. Match to your programmatic walk.
Repeat for the variant with biases on Q/K/V projections (GPT-2 has them; LLaMA does not). New formula. New total. (Roughly \(+4 d_\text{model} n_\text{layers} = +512\) for Mini-GPT.)

Task 3 — stacked bar visualisation¶

Produce a stacked bar plot with one column per architectural component:

parameter group        params
─────────────────────────────────
embedding (tied)        4,096
block 0 attention      16,384
block 0 FFN            33,088
block 0 LayerNorms        256
block 1 attention      16,384
block 1 FFN            33,088
block 1 LayerNorms        256
final LayerNorm           128
─────────────────────────────────
TOTAL                 103,680

Save as experiments/<date>-phase-17-param-inventory/distribution.png. Use a horizontal bar chart for readability.

Task 4 — comparison to GPT-2 small (just the numbers, for scale)¶

GPT-2 small: \(d_\text{model} = 768, n_\text{heads} = 12, n_\text{layers} = 12, d_\text{ff} = 3072, |V| = 50257\).

Compute the parameter count using your formula. Compare to the published "117M". (Note: 117M is sometimes quoted, 124M elsewhere — the difference is whether tied embeddings are counted; clarify which convention matches your formula.)

This is a paper-only exercise — you're not building GPT-2, just confirming your formula generalises.

Task 5 — what changes when you scale?¶

For three scaled-up Mini-GPT variants:

| Variant | \(d_\text{model}\) | \(n_\text{layers}\) | \(d_\text{ff}\) | \(|V|\) | Total params (your formula) | |---|---|---|---|---|---| | Mini-GPT (locked) | 64 | 2 | 256 | 64 | 103,680 | | Mini-GPT-medium | 128 | 4 | 512 | 64 | ? | | Mini-GPT-large | 256 | 6 | 1024 | 64 | ? |

Compute and tabulate. Observe how the FFN parameter fraction grows with depth.

Measurements to capture¶

Programmatic count: 103,680 exactly (for the locked config).
Closed-form match: ✓ or ✗.
Stacked bar plot saved.
Scaling table for the three variants.

Acceptance¶

scripts/phase17_param_inventory.py returns the per-component dict.
Sum equals the closed-form prediction exactly.
Notes contain the formula derivation in your own words.
Stacked bar plot saved.
Scaling table populated.
Lab notes identify the FFN as ~64% of per-block params and ~64% of total (for the locked config).

Pitfalls to expect¶

Forgetting the final LayerNorm. Common. Your count comes up 128 short; investigate before declaring victory.
Counting an untied LM head. If you accidentally created self.W_LM as a separate matrix in mini_gpt.py, your count comes up \(|V| \cdot d_\text{model} = 4096\) too high. Re-read theory/03-tied-embeddings-and-lm-head.md.
Off-by-one on the bias decision. Decide once whether your Q/K/V projections have biases. GPT-2: yes; LLaMA: no; Mini-GPT default: no (cleaner formula). Stay consistent across block.py, mini_gpt.py, and this lab.
Counting in-place buffers. RoPE precomputes a cosine/sine table; that's a buffer, not a parameter. Do not include it in the parameter count.

Next: 03-causality-perturbation.md