English · Español

Lab 01 — Assemble Mini-GPT¶

Read theory/01-transformer-block.md and theory/03-tied-embeddings-and-lm-head.md. Do not consult solutions/.

Objective¶

Wire your Phase 13 embeddings, Phase 16 RoPE, Phase 15 MHA, and Lab 00's TransformerBlock into a single end-to-end MiniGPT class. Run a forward on the canonical 8-token verb-grammar example and verify the output shape and that information flows from the input to the output.

Setup¶

src/minimodel/mini_gpt.py is the new module. Locked config (from PHASE_17_PLAN.md §1):

config = MiniGPTConfig(
    d_model=64,
    n_heads=4,
    n_layers=2,
    d_ff=256,
    vocab_size=64,
    context_len=32,
)

These are not magic numbers — they're the locked-in numbers used for the rest of the project. Once you write mini_gpt.py, do not change these defaults without a phase: revise 17-* commit.

Tasks¶

Task 1 — `MiniGPTConfig` dataclass¶

@dataclass(frozen=True)
class MiniGPTConfig:
    d_model: int
    n_heads: int
    n_layers: int
    d_ff: int
    vocab_size: int
    context_len: int

    def __post_init__(self):
        # Validation: d_model % n_heads == 0, etc.
        ...

Add validation: d_model % n_heads == 0, all dimensions positive, context_len >= 1. A bad config should raise at construction, not later.

Task 2 — `MiniGPT` class¶

class MiniGPT:
    def __init__(self, config: MiniGPTConfig):
        self.config = config
        self.E = Parameter(np.random.randn(config.vocab_size, config.d_model) * 0.02)
        self.blocks = [
            TransformerBlock(config.d_model, config.n_heads, config.d_ff)
            for _ in range(config.n_layers)
        ]
        self.ln_final = LayerNorm(config.d_model)
        # NOTE: no separate LM head — tied with self.E.

    def __call__(self, tokens: NDArray[np.int64]) -> NDArray[np.float64]:
        """tokens: (T,) int64 → logits: (T, vocab_size) float64."""
        h = self.E[tokens]
        for block in self.blocks:
            h = block(h)
        h = self.ln_final(h)
        logits = h @ self.E.T
        return logits

Constraints:

Tied embedding: the LM head reuses self.E. Never create self.W_LM as a separate parameter.
No softmax inside forward — return logits.
Accept tokens as int64 array of shape (T,). Batched calls ((B, T)) are an optional extension; do the single-sequence form first.

Task 3 — run on the canonical 8-token sequence¶

The Phase 12 corpus gave you a tokenizer. Use it to encode:

<bos> I work , you work , he

That's 8 tokens (including <bos>). Pass through Mini-GPT. The output should have shape (8, 64).

model = MiniGPT(config)
tokens = tokenizer.encode("<bos> I work , you work , he")
assert len(tokens) == 8
logits = model(tokens)
assert logits.shape == (8, 64)

Now sanity-check: print the top-5 predicted tokens at position 7 (the position after he — where the model would predict the next token):

top5 = np.argsort(logits[-1])[-5:][::-1]
print([tokenizer.decode([t]) for t in top5])

Expected: random garbage. The model is untrained. Lab 01 is verifying mechanism, not capability. Training is Phase 18 — at which point this same line should output something like works near the top.

Task 4 — sanity: forward depends on the entire prefix¶

Verify that changing token 0 changes logits at position 7. Specifically:

tokens_a = tokenizer.encode("<bos> I work , you work , he")
tokens_b = tokens_a.copy()
tokens_b[0] = some_other_token_id  # change <bos> to something else
logits_a = model(tokens_a)
logits_b = model(tokens_b)
# Output at position 7 must differ in some component.
assert not np.allclose(logits_a[7], logits_b[7])

This is a weak test, but it catches one common bug: forgetting to wire the embedding or to chain through all blocks.

Task 5 — RoPE-vs-no-RoPE comparison¶

Run the same forward with RoPE disabled (replace RoPE with identity) and with RoPE enabled. Output should differ. Capture the L2 norm of the difference for at least one position.

This isn't a test you can fail — it just confirms RoPE is on the path. Lab 03 will go further with causality testing.

Measurements to capture¶

Forward-pass wall-clock for the 8-token sequence on Borja's CPU. Should be ~10–50 ms.
Output shape: (8, 64).
Top-5 predictions at position 7 — captured (will be random).
L2 difference between RoPE-enabled and RoPE-disabled forwards.

Save to experiments/<date>-phase-17-mini-gpt-assembly/manifest.json.

Acceptance¶

src/minimodel/mini_gpt.py exists with MiniGPTConfig and MiniGPT.
MiniGPTConfig validates inputs.
Tied embedding: MiniGPT has exactly one parameter matrix of shape (vocab_size, d_model). (Inspect via model.E.shape.)
Forward on 8-token canonical sequence produces shape (8, 64).
Changing token 0 changes position-7 logits.
No softmax inside forward (logits returned raw).
No transformers library used. No PyTorch.

Pitfalls to expect¶

Untying by accident. self.W_LM = self.E.copy() makes a different parameter. Don't do this. Use self.E.T directly in the forward.
Missing the final LayerNorm. Pre-LN transformers add an LN after the last block, before the LM head. Skipping it means the LM head sees unnormalised residuals — works mathematically but is the wrong architecture. Lab 02 will catch this in the parameter inventory (you'll be 128 params short).
Off-by-one with context_len vs T. context_len is a maximum. The actual sequence length \(T\) can be anything up to it. Don't pad to context_len — only the RoPE table needs to span up to context_len.
Returning probabilities instead of logits. The model returns logits. Softmax is the job of the loss (Phase 18) and the sampler (Phase 21). If you np.exp(logits) / ... inside the model, undo it.

Next: 02-parameter-inventory.md