Skip to content

English · Español

Lab 02 — Parameter inventory: count every parameter

Read theory/02-ffn-and-activations.md (§"FFN parameter count") and PHASE_17_PLAN.md §6 (at repo root). Do not consult solutions/.

Objective

Walk through your Mini-GPT layer-by-layer and count every single parameter. Match the count to a closed-form formula. Plot the distribution as a stacked bar (where does the parameter mass live?). When you finish this lab you should be able to write down the parameter count of any Mini-GPT variant from its config alone.

Background

The closed-form parameter count for Mini-GPT (tied embeddings, biases on FFN, no biases on Q/K/V projections, LayerNorm with scale + shift):

\[ |\theta| = \underbrace{|V| \cdot d_\text{model}}_\text{tied embedding} + n_\text{layers} \Big[ \underbrace{4 d_\text{model}^2}_\text{Q,K,V,O proj} + \underbrace{2 d_\text{model} d_\text{ff} + d_\text{ff} + d_\text{model}}_\text{FFN} + \underbrace{4 d_\text{model}}_\text{2 LN's} \Big] + \underbrace{2 d_\text{model}}_\text{final LN} \]

For Mini-GPT (\(d_\text{model} = 64, n_\text{heads} = 4, n_\text{layers} = 2, d_\text{ff} = 256, |V| = 64\)):

  • Tied embedding: \(64 \cdot 64 = 4{,}096\)
  • Per block: \(4 \cdot 64^2 + 2 \cdot 64 \cdot 256 + 256 + 64 + 4 \cdot 64 = 16{,}384 + 32{,}768 + 256 + 64 + 256 = 49{,}728\)
  • Two blocks: \(99{,}456\)
  • Final LN: \(128\)
  • Total: \(4{,}096 + 99{,}456 + 128 = 103{,}680\) parameters.

If your earlier hand count was ~57k, you forgot to include the FFN bias or the LayerNorm parameters. Re-derive.

Tasks

Task 1 — programmatic walk

In a script scripts/phase17_param_inventory.py:

def inventory(model: MiniGPT) -> dict[str, int]:
    """Return {component_name: param_count} for every parameter in the model."""

Components to enumerate (suggested keys):

  • E (embedding, tied)
  • block_0.attn.W_q, W_k, W_v, W_o
  • block_0.ffn.W_1, b_1, W_2, b_2
  • block_0.ln1.gamma, beta, ln2.gamma, beta
  • (same for block_1)
  • ln_final.gamma, beta

Sum all entries; assert equal to the closed-form prediction (103,680 for the locked config) within zero error — this is an integer counting, not a numerical computation.

Task 2 — formula derivation in your notes

In learners/borja/phase-17/notes.md:

  1. Derive the per-component count for one transformer block from first principles. Show your work — e.g., "Q projection: maps \(d_\text{model}\)\(d_\text{model}\), no bias, so \(d_\text{model}^2 = 4096\)."
  2. Sum to get the formula. Plug in. Match to your programmatic walk.
  3. Repeat for the variant with biases on Q/K/V projections (GPT-2 has them; LLaMA does not). New formula. New total. (Roughly \(+4 d_\text{model} n_\text{layers} = +512\) for Mini-GPT.)

Task 3 — stacked bar visualisation

Produce a stacked bar plot with one column per architectural component:

parameter group        params
─────────────────────────────────
embedding (tied)        4,096
block 0 attention      16,384
block 0 FFN            33,088
block 0 LayerNorms        256
block 1 attention      16,384
block 1 FFN            33,088
block 1 LayerNorms        256
final LayerNorm           128
─────────────────────────────────
TOTAL                 103,680

Save as experiments/<date>-phase-17-param-inventory/distribution.png. Use a horizontal bar chart for readability.

Task 4 — comparison to GPT-2 small (just the numbers, for scale)

GPT-2 small: \(d_\text{model} = 768, n_\text{heads} = 12, n_\text{layers} = 12, d_\text{ff} = 3072, |V| = 50257\).

Compute the parameter count using your formula. Compare to the published "117M". (Note: 117M is sometimes quoted, 124M elsewhere — the difference is whether tied embeddings are counted; clarify which convention matches your formula.)

This is a paper-only exercise — you're not building GPT-2, just confirming your formula generalises.

Task 5 — what changes when you scale?

For three scaled-up Mini-GPT variants:

| Variant | \(d_\text{model}\) | \(n_\text{layers}\) | \(d_\text{ff}\) | \(|V|\) | Total params (your formula) | |---|---|---|---|---|---| | Mini-GPT (locked) | 64 | 2 | 256 | 64 | 103,680 | | Mini-GPT-medium | 128 | 4 | 512 | 64 | ? | | Mini-GPT-large | 256 | 6 | 1024 | 64 | ? |

Compute and tabulate. Observe how the FFN parameter fraction grows with depth.

Measurements to capture

  • Programmatic count: 103,680 exactly (for the locked config).
  • Closed-form match: ✓ or ✗.
  • Stacked bar plot saved.
  • Scaling table for the three variants.

Acceptance

  • scripts/phase17_param_inventory.py returns the per-component dict.
  • Sum equals the closed-form prediction exactly.
  • Notes contain the formula derivation in your own words.
  • Stacked bar plot saved.
  • Scaling table populated.
  • Lab notes identify the FFN as ~64% of per-block params and ~64% of total (for the locked config).

Pitfalls to expect

  • Forgetting the final LayerNorm. Common. Your count comes up 128 short; investigate before declaring victory.
  • Counting an untied LM head. If you accidentally created self.W_LM as a separate matrix in mini_gpt.py, your count comes up \(|V| \cdot d_\text{model} = 4096\) too high. Re-read theory/03-tied-embeddings-and-lm-head.md.
  • Off-by-one on the bias decision. Decide once whether your Q/K/V projections have biases. GPT-2: yes; LLaMA: no; Mini-GPT default: no (cleaner formula). Stay consistent across block.py, mini_gpt.py, and this lab.
  • Counting in-place buffers. RoPE precomputes a cosine/sine table; that's a buffer, not a parameter. Do not include it in the parameter count.

Next: 03-causality-perturbation.md