Skip to content

English · Español

Phase 36 — Frontier Architectures

Requires: 35 — Distributed Training & Inference Teaches: mixture-of-experts · mamba · state-space-models · speculative-decoding Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Una gira de arquitecturas modernas: MoE, MLA, RWKV/Mamba, speculative decoding, "reasoning models". La pregunta central de toda la fase no es "¿cómo funciona?", es "¿me sirve para esto?". Spoiler: para el tutor de gramática, casi nunca.


Goal

Survey the four families of frontier architectures (MoE, MLA, state-space models, speculative decoding + reasoning) deeply enough that Borja can read any current paper or codebase, write the FLOP and memory math on a napkin, and — for each technique — judge whether it would help the grammar tutor.

The judgement is the load-bearing part. Each technique solves a specific bottleneck. The grammar tutor has none of those bottlenecks. Phase 36 turns that mismatch into the lesson: smaller scope is sometimes the right answer, and recognizing that is harder than copying a flashy architecture.

The phase is consciously concept-heavy, light implementation (§4 spec). One tiny MoE experiment on local CPU, one pencil-and-paper MLA derivation, one Mamba reading walkthrough, one speculative-decoding survey. Zero cloud cost. Zero production code.

Read order

  1. theory/00-motivation.md — why "is this the right tool for my task" is the only question that matters in arch surveys.
  2. theory/01-moe.md — Mixture of Experts: routing, load balancing, expert parallelism. Bottleneck addressed: parameter count growing faster than compute.
  3. theory/02-mla.md — Multi-Latent Attention (DeepSeek): low-rank latent cache. Bottleneck addressed: KV-cache memory at long context.
  4. theory/03-state-space-models.md — RWKV, Mamba, S4. Selective scan. Hybrids (Jamba). Bottleneck addressed: attention's quadratic time and growing KV cache at very long context.
  5. theory/04-speculative-and-reasoning.md — Speculative decoding family (vanilla / Medusa / EAGLE / Lookahead) + "reasoning models" / test-time-compute scaling. Bottleneck addressed: decode latency.
  6. lab/00-moe-on-grammar-tutor.md — train a 2-expert MoE variant locally. Confirm: doesn't help.
  7. lab/01-mla-math-exercise.md — derive MLA's KV-cache reduction. Confirm: irrelevant at our scale.
  8. lab/02-mamba-walkthrough.md — annotated reading of mamba-minimal's selective_scan.
  9. lab/03-speculative-survey.md — one-page survey + recommendation for the grammar tutor.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase 17 MiniGPT and Phase 18 training loop are in.

Definition of Done

See PHASE_36_PLAN.md §6. Briefly:

  • 2-expert MoE locally runs and converges; honest negative-result note committed.
  • MLA KV-cache math derived for the grammar tutor's dimensions.
  • Mamba selective-scan walkthrough committed (~1 page + line citations).
  • Speculative decoding survey committed.
  • Architecture decision-tree diagram committed under diagrams/.
  • /quiz 36 ≥ 70%.

What this phase intentionally does NOT cover

  • Implementing Mamba or MLA in PyTorch. Read-only. The implementations require GPU-kernel context (Phase 24) and aren't pedagogical at our scale.
  • Training a real MoE. A "real" MoE is 100B+ parameters; we're a calculator. The 2-expert local experiment is a stub.
  • Multi-modal architectures (vision encoders, audio encoders, fusion). §4 mentions for completeness; out of scope here.
  • RLHF / DPO / "reasoning RL". Phase 28 already mentioned these as concept-only; not re-introduced.
  • MoE serving infrastructure (expert parallelism at scale, all-to-all comm patterns, dropless MoE). Phase 35 territory if Borja revisits.
  • Speculative decoding implementation. Survey only. Implementing is a fun side-project but distracts from the phase's purpose.
  • 3D parallelism for MoE training. Phase 35 territory.

Phase 36's scope is vocabulary, math, and judgement on frontier architectures, applied to the microscopic grammar tutor. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.