English · Español
Phase 36 — Frontier Architectures¶
Requires: 35 — Distributed Training & Inference Teaches:
mixture-of-experts·mamba·state-space-models·speculative-decodingJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Una gira de arquitecturas modernas: MoE, MLA, RWKV/Mamba, speculative decoding, "reasoning models". La pregunta central de toda la fase no es "¿cómo funciona?", es "¿me sirve para esto?". Spoiler: para el tutor de gramática, casi nunca.
Goal¶
Survey the four families of frontier architectures (MoE, MLA, state-space models, speculative decoding + reasoning) deeply enough that Borja can read any current paper or codebase, write the FLOP and memory math on a napkin, and — for each technique — judge whether it would help the grammar tutor.
The judgement is the load-bearing part. Each technique solves a specific bottleneck. The grammar tutor has none of those bottlenecks. Phase 36 turns that mismatch into the lesson: smaller scope is sometimes the right answer, and recognizing that is harder than copying a flashy architecture.
The phase is consciously concept-heavy, light implementation (§4 spec). One tiny MoE experiment on local CPU, one pencil-and-paper MLA derivation, one Mamba reading walkthrough, one speculative-decoding survey. Zero cloud cost. Zero production code.
Read order¶
theory/00-motivation.md— why "is this the right tool for my task" is the only question that matters in arch surveys.theory/01-moe.md— Mixture of Experts: routing, load balancing, expert parallelism. Bottleneck addressed: parameter count growing faster than compute.theory/02-mla.md— Multi-Latent Attention (DeepSeek): low-rank latent cache. Bottleneck addressed: KV-cache memory at long context.theory/03-state-space-models.md— RWKV, Mamba, S4. Selective scan. Hybrids (Jamba). Bottleneck addressed: attention's quadratic time and growing KV cache at very long context.theory/04-speculative-and-reasoning.md— Speculative decoding family (vanilla / Medusa / EAGLE / Lookahead) + "reasoning models" / test-time-compute scaling. Bottleneck addressed: decode latency.lab/00-moe-on-grammar-tutor.md— train a 2-expert MoE variant locally. Confirm: doesn't help.lab/01-mla-math-exercise.md— derive MLA's KV-cache reduction. Confirm: irrelevant at our scale.lab/02-mamba-walkthrough.md— annotated reading ofmamba-minimal's selective_scan.lab/03-speculative-survey.md— one-page survey + recommendation for the grammar tutor.
solutions/ is empty during pre-write — populated at phase open after Borja's Phase 17 MiniGPT and Phase 18 training loop are in.
Definition of Done¶
See PHASE_36_PLAN.md §6. Briefly:
- 2-expert MoE locally runs and converges; honest negative-result note committed.
- MLA KV-cache math derived for the grammar tutor's dimensions.
- Mamba selective-scan walkthrough committed (~1 page + line citations).
- Speculative decoding survey committed.
- Architecture decision-tree diagram committed under
diagrams/. /quiz 36≥ 70%.
What this phase intentionally does NOT cover¶
- Implementing Mamba or MLA in PyTorch. Read-only. The implementations require GPU-kernel context (Phase 24) and aren't pedagogical at our scale.
- Training a real MoE. A "real" MoE is 100B+ parameters; we're a calculator. The 2-expert local experiment is a stub.
- Multi-modal architectures (vision encoders, audio encoders, fusion). §4 mentions for completeness; out of scope here.
- RLHF / DPO / "reasoning RL". Phase 28 already mentioned these as concept-only; not re-introduced.
- MoE serving infrastructure (expert parallelism at scale, all-to-all comm patterns, dropless MoE). Phase 35 territory if Borja revisits.
- Speculative decoding implementation. Survey only. Implementing is a fun side-project but distracts from the phase's purpose.
- 3D parallelism for MoE training. Phase 35 territory.
Phase 36's scope is vocabulary, math, and judgement on frontier architectures, applied to the microscopic grammar tutor. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu & Dao · 2023. the leading non-attention sequence architecture.
- 📄 Switch Transformers: Scaling to Trillion Parameter Models — Fedus, Zoph, Shazeer · 2021. sparse mixture-of-experts at scale.
- 📄 Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias · 2022. draft-and-verify decoding for lower latency.