English · Español
Lab 01 — Assemble Mini-GPT¶
Read
theory/01-transformer-block.mdandtheory/03-tied-embeddings-and-lm-head.md. Do not consultsolutions/.
Objective¶
Wire your Phase 13 embeddings, Phase 16 RoPE, Phase 15 MHA, and Lab 00's TransformerBlock into a single end-to-end MiniGPT class. Run a forward on the canonical 8-token verb-grammar example and verify the output shape and that information flows from the input to the output.
Setup¶
src/minimodel/mini_gpt.py is the new module. Locked config (from PHASE_17_PLAN.md §1):
config = MiniGPTConfig(
d_model=64,
n_heads=4,
n_layers=2,
d_ff=256,
vocab_size=64,
context_len=32,
)
These are not magic numbers — they're the locked-in numbers used for the rest of the project. Once you write mini_gpt.py, do not change these defaults without a phase: revise 17-* commit.
Tasks¶
Task 1 — MiniGPTConfig dataclass¶
@dataclass(frozen=True)
class MiniGPTConfig:
d_model: int
n_heads: int
n_layers: int
d_ff: int
vocab_size: int
context_len: int
def __post_init__(self):
# Validation: d_model % n_heads == 0, etc.
...
Add validation: d_model % n_heads == 0, all dimensions positive, context_len >= 1. A bad config should raise at construction, not later.
Task 2 — MiniGPT class¶
class MiniGPT:
def __init__(self, config: MiniGPTConfig):
self.config = config
self.E = Parameter(np.random.randn(config.vocab_size, config.d_model) * 0.02)
self.blocks = [
TransformerBlock(config.d_model, config.n_heads, config.d_ff)
for _ in range(config.n_layers)
]
self.ln_final = LayerNorm(config.d_model)
# NOTE: no separate LM head — tied with self.E.
def __call__(self, tokens: NDArray[np.int64]) -> NDArray[np.float64]:
"""tokens: (T,) int64 → logits: (T, vocab_size) float64."""
h = self.E[tokens]
for block in self.blocks:
h = block(h)
h = self.ln_final(h)
logits = h @ self.E.T
return logits
Constraints:
- Tied embedding: the LM head reuses
self.E. Never createself.W_LMas a separate parameter. - No softmax inside
forward— return logits. - Accept tokens as int64 array of shape
(T,). Batched calls ((B, T)) are an optional extension; do the single-sequence form first.
Task 3 — run on the canonical 8-token sequence¶
The Phase 12 corpus gave you a tokenizer. Use it to encode:
That's 8 tokens (including <bos>). Pass through Mini-GPT. The output should have shape (8, 64).
model = MiniGPT(config)
tokens = tokenizer.encode("<bos> I work , you work , he")
assert len(tokens) == 8
logits = model(tokens)
assert logits.shape == (8, 64)
Now sanity-check: print the top-5 predicted tokens at position 7 (the position after he — where the model would predict the next token):
Expected: random garbage. The model is untrained. Lab 01 is verifying mechanism, not capability. Training is Phase 18 — at which point this same line should output something like works near the top.
Task 4 — sanity: forward depends on the entire prefix¶
Verify that changing token 0 changes logits at position 7. Specifically:
tokens_a = tokenizer.encode("<bos> I work , you work , he")
tokens_b = tokens_a.copy()
tokens_b[0] = some_other_token_id # change <bos> to something else
logits_a = model(tokens_a)
logits_b = model(tokens_b)
# Output at position 7 must differ in some component.
assert not np.allclose(logits_a[7], logits_b[7])
This is a weak test, but it catches one common bug: forgetting to wire the embedding or to chain through all blocks.
Task 5 — RoPE-vs-no-RoPE comparison¶
Run the same forward with RoPE disabled (replace RoPE with identity) and with RoPE enabled. Output should differ. Capture the L2 norm of the difference for at least one position.
This isn't a test you can fail — it just confirms RoPE is on the path. Lab 03 will go further with causality testing.
Measurements to capture¶
- Forward-pass wall-clock for the 8-token sequence on Borja's CPU. Should be ~10–50 ms.
- Output shape:
(8, 64). - Top-5 predictions at position 7 — captured (will be random).
- L2 difference between RoPE-enabled and RoPE-disabled forwards.
Save to experiments/<date>-phase-17-mini-gpt-assembly/manifest.json.
Acceptance¶
-
src/minimodel/mini_gpt.pyexists withMiniGPTConfigandMiniGPT. -
MiniGPTConfigvalidates inputs. - Tied embedding:
MiniGPThas exactly one parameter matrix of shape(vocab_size, d_model). (Inspect viamodel.E.shape.) - Forward on 8-token canonical sequence produces shape
(8, 64). - Changing token 0 changes position-7 logits.
- No softmax inside forward (logits returned raw).
- No
transformerslibrary used. No PyTorch.
Pitfalls to expect¶
- Untying by accident.
self.W_LM = self.E.copy()makes a different parameter. Don't do this. Useself.E.Tdirectly in the forward. - Missing the final LayerNorm. Pre-LN transformers add an LN after the last block, before the LM head. Skipping it means the LM head sees unnormalised residuals — works mathematically but is the wrong architecture. Lab 02 will catch this in the parameter inventory (you'll be 128 params short).
- Off-by-one with context_len vs T.
context_lenis a maximum. The actual sequence length \(T\) can be anything up to it. Don't pad tocontext_len— only the RoPE table needs to span up tocontext_len. - Returning probabilities instead of logits. The model returns logits. Softmax is the job of the loss (Phase 18) and the sampler (Phase 21). If you
np.exp(logits) / ...inside the model, undo it.