English · Español
Phase 17 — Tiny Transformer Block & Mini-GPT¶
Requires: 10 — Initialization, Normalization, Residuals · 15 — Attention from Scratch · 16 — Positional Encodings Teaches:
transformer-block·pre-ln·ffn·gelu·tied-embeddings·lm-headJump to any chapter from the phase reference index.
Chapter map¶
🇪🇸 Aquí se ensambla todo. Embedding (13) + RoPE (16) + multi-head attention (15) + FFN + LayerNorm + residual = un bloque transformer. Apilas dos, pones cabeza LM atada, y tienes un Mini-GPT. No se entrena. Sólo se ensambla, se cuentan los parámetros uno a uno y se verifica el forward contra una referencia hecha a mano. Entrenamiento: Fase 18.
Anchors: LYNX_CORTEX.md §4 / PHASE 17, PHASE_17_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13 (verb-grammar scope).
Why this phase exists¶
Phases 13–16 built the parts: embeddings, sequence-model baselines, multi-head attention, positional encodings. Phase 17 glues them into the smallest object that is recognisably a language model: the Pre-LN transformer block, stacked twice, with a tied LM head on top. The goal is mechanism, not capability: every operation, every shape, every parameter accounted for. Training is the next phase; sampling is four phases away. The deliverable is a NumPy class whose forward pass runs end-to-end on the canonical 8-token verb-grammar sequence and whose parameter count matches a closed-form formula to the digit.
What you'll build¶
A MiniGPT(d_model=64, n_heads=4, n_layers=2, d_ff=256, vocab_size=64) class composed of:
LayerNorm(own implementation, autograd-compatible via Phase 8 tensors)FFN(two linear layers with GELU)TransformerBlock(Pre-LN: LN → MHA → +res → LN → FFN → +res)- Stacked blocks + final LN + tied LM head (\(\text{logits} = h \cdot E^\top\))
Total parameter count for the locked config: 103,680 params (~103k), derived in lab 02.
Files¶
phase-17-mini-gpt/
├── README.md # this file
├── theory/
│ ├── 00-motivation.md # why glue now, what the residual stream is
│ ├── 01-transformer-block.md # Pre-LN block anatomy, the residual stream
│ ├── 02-ffn-and-activations.md # why FFN exists, GELU, the 4× ratio
│ └── 03-tied-embeddings-and-lm-head.md # tied weights, the final softmax
├── lab/
│ ├── 00-block-by-hand.md # one block forward on a 2-token, d_model=4 toy
│ ├── 01-assemble-mini-gpt.md # full MiniGPT forward on 8-token verb sequence
│ ├── 02-parameter-inventory.md # count params layer-by-layer; match the formula
│ └── 03-causality-perturbation.md # verify causal mask holds end-to-end
├── solutions/ # populated at phase-open; do NOT read first
├── notebooks/
└── diagrams/ # block diagram, parameter stacked-bar, shape trace
What this phase does NOT cover¶
- Training. Phase 18. No loss, no optimizer, no gradient step in Phase 17.
- Sampling / generation. Phase 21. Phase 17 produces logits; it does not pick a token.
- KV cache. Phase 22. Every forward in Phase 17 recomputes from scratch.
- Dropout / weight init beyond simple Gaussian. Phase 18.
- PyTorch cross-check. Phase 25 — Phase 17 is pure NumPy + the Phase 8 autograd.
- Mixed precision / fp16 / bf16. Phase 23+. Mini-GPT is fp32 throughout.
- Pre-LN derivation beyond a paragraph. Pre-LN is locked. Post-LN is a footnote.
Phase-open checklist (per CLAUDE.md §1)¶
- Re-read
PHASE_17_PLAN.md§§0–8. - Re-read
LYNX_CORTEX.md§4 / PHASE 17 (lines 472–484) and §A13. - Confirm Phase 16 lab 03 resolved RoPE vs sinusoidal — Phase 17 inherits that decision.
- Open
src/minimodel/BLUEPRINT.md§4 (transformer) and request review before any.pyfiles. - Read
theory/00-motivation.mdfirst; do not skip to lab.
Next: theory/00-motivation.md
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Language Models are Unsupervised Multitask Learners (GPT-2) — Radford et al. · 2019. the decoder-only architecture you assemble.
- 💻 nanoGPT — Karpathy · 2022. a minimal, readable GPT to compare against.