Skip to content

English · Español

Phase 17 — Tiny Transformer Block & Mini-GPT

Requires: 10 — Initialization, Normalization, Residuals · 15 — Attention from Scratch · 16 — Positional Encodings Teaches: transformer-block · pre-ln · ffn · gelu · tied-embeddings · lm-head Jump to any chapter from the phase reference index.

Chapter map

🇪🇸 Aquí se ensambla todo. Embedding (13) + RoPE (16) + multi-head attention (15) + FFN + LayerNorm + residual = un bloque transformer. Apilas dos, pones cabeza LM atada, y tienes un Mini-GPT. No se entrena. Sólo se ensambla, se cuentan los parámetros uno a uno y se verifica el forward contra una referencia hecha a mano. Entrenamiento: Fase 18.

Anchors: LYNX_CORTEX.md §4 / PHASE 17, PHASE_17_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13 (verb-grammar scope).

Why this phase exists

Phases 13–16 built the parts: embeddings, sequence-model baselines, multi-head attention, positional encodings. Phase 17 glues them into the smallest object that is recognisably a language model: the Pre-LN transformer block, stacked twice, with a tied LM head on top. The goal is mechanism, not capability: every operation, every shape, every parameter accounted for. Training is the next phase; sampling is four phases away. The deliverable is a NumPy class whose forward pass runs end-to-end on the canonical 8-token verb-grammar sequence and whose parameter count matches a closed-form formula to the digit.

What you'll build

A MiniGPT(d_model=64, n_heads=4, n_layers=2, d_ff=256, vocab_size=64) class composed of:

  • LayerNorm (own implementation, autograd-compatible via Phase 8 tensors)
  • FFN (two linear layers with GELU)
  • TransformerBlock (Pre-LN: LN → MHA → +res → LN → FFN → +res)
  • Stacked blocks + final LN + tied LM head (\(\text{logits} = h \cdot E^\top\))

Total parameter count for the locked config: 103,680 params (~103k), derived in lab 02.

Files

phase-17-mini-gpt/
├── README.md                          # this file
├── theory/
│   ├── 00-motivation.md              # why glue now, what the residual stream is
│   ├── 01-transformer-block.md       # Pre-LN block anatomy, the residual stream
│   ├── 02-ffn-and-activations.md     # why FFN exists, GELU, the 4× ratio
│   └── 03-tied-embeddings-and-lm-head.md  # tied weights, the final softmax
├── lab/
│   ├── 00-block-by-hand.md           # one block forward on a 2-token, d_model=4 toy
│   ├── 01-assemble-mini-gpt.md       # full MiniGPT forward on 8-token verb sequence
│   ├── 02-parameter-inventory.md     # count params layer-by-layer; match the formula
│   └── 03-causality-perturbation.md  # verify causal mask holds end-to-end
├── solutions/                         # populated at phase-open; do NOT read first
├── notebooks/
└── diagrams/                          # block diagram, parameter stacked-bar, shape trace

What this phase does NOT cover

  • Training. Phase 18. No loss, no optimizer, no gradient step in Phase 17.
  • Sampling / generation. Phase 21. Phase 17 produces logits; it does not pick a token.
  • KV cache. Phase 22. Every forward in Phase 17 recomputes from scratch.
  • Dropout / weight init beyond simple Gaussian. Phase 18.
  • PyTorch cross-check. Phase 25 — Phase 17 is pure NumPy + the Phase 8 autograd.
  • Mixed precision / fp16 / bf16. Phase 23+. Mini-GPT is fp32 throughout.
  • Pre-LN derivation beyond a paragraph. Pre-LN is locked. Post-LN is a footnote.

Phase-open checklist (per CLAUDE.md §1)

  1. Re-read PHASE_17_PLAN.md §§0–8.
  2. Re-read LYNX_CORTEX.md §4 / PHASE 17 (lines 472–484) and §A13.
  3. Confirm Phase 16 lab 03 resolved RoPE vs sinusoidal — Phase 17 inherits that decision.
  4. Open src/minimodel/BLUEPRINT.md §4 (transformer) and request review before any .py files.
  5. Read theory/00-motivation.md first; do not skip to lab.

Next: theory/00-motivation.md

Further reading

Optional — enrichment, not required to pass the phase.