English · Español
00 — Motivation: techniques exist to solve bottlenecks, not as fashion¶
🇪🇸 La pregunta correcta al leer un paper de arquitectura nueva no es "¿qué hace?", es "¿qué cuello de botella resuelve, y lo tengo yo?". El currículum entero está pensado para que Borja desarrolle ese reflejo: cuando aparece una técnica brillante, identificar el bottleneck, comprobar si lo tiene, decidir. Para el tutor de gramática, la respuesta casi siempre es "no".
You finished Phase 35 with vocabulary for distributed training and inference. You can read Megatron and FSDP source. You know when TP, PP, FSDP, and DDP apply.
This phase widens the survey: frontier architectures. MoE, MLA, state-space models (RWKV / Mamba / Jamba), speculative decoding variants, "reasoning models." These are the architectures and techniques the field has rallied around in 2023–2026 to push past dense-transformer ceilings.
The naive position on this material is: "learn them all; pick the newest." That position is wrong. Every one of these techniques was designed to relieve a specific bottleneck. Apply it to a task that doesn't have that bottleneck and you get net negative results: more code, more failure modes, slower training, no gain.
Phase 36 trains the opposite reflex.
The four bottlenecks and the four families¶
| Technique | Bottleneck addressed | Mechanism |
|---|---|---|
| MoE (Mixture of Experts) | Parameter count growing faster than compute budget | Sparse routing: top-\(k\) of \(E\) FFN experts per token. Total params ∝ \(E\); FLOPs/token ∝ \(k\). |
| MLA (Multi-Latent Attention) | KV-cache memory at long context | K and V cached as low-rank latents; reconstruct on demand. |
| Mamba / RWKV (state-space models) | Attention's quadratic time and growing KV cache at very long context | Replace attention with a recurrence. Linear time, constant memory per layer. |
| Speculative decoding | Decode latency (memory-bound, low arithmetic intensity) | Draft model proposes \(k\) tokens; verifier accepts longest correct prefix. 2–5× tokens per forward. |
These four bottlenecks are real and significant at frontier model scale (10B+ parameters, multi-thousand-token context, multi-user inference). The grammar tutor — at ~500k parameters, ~32-token max context, single-user CPU inference — has none of them.
That is the central observation Borja will reach by the end of this phase, on each technique independently. No frontier architecture would meaningfully improve the grammar tutor. Dense, single-layer transformer attention on a 600-form vocabulary is already the right tool.
Why this isn't deflationary¶
The conclusion "you don't need fancy architectures here" sounds anti-intellectual: "don't learn things you won't use." That's the wrong reading. The phase produces three pieces of value:
- You can read any current paper or codebase without flinching. Frontier-model context is in every newer paper Borja will pick up. Vocabulary matters even when the techniques don't apply.
- You learn to judge techniques against bottlenecks, not against marketing. This skill generalizes. Six months from now, "speculative decoding via X" lands on Hacker News; Borja's first question is "what's the bottleneck X helps with, and does my workload have it?" — not "should I add X?".
- You see the limits of frontier techniques. MoE has its own failure modes (load imbalance, training instability). MLA changes the compute pattern in ways that may not always be net-positive. Mamba is competitive only when the selective mechanism works. None of this is in marketing material. The reading exercises in this phase surface the honest tradeoffs.
How the phase is structured¶
Each of the four families gets:
- A theory file with the math (parameter count, comm cost, memory math).
- A diagram showing how the technique modifies the standard transformer block.
- A short reading exercise on a reference implementation.
- A grammar-tutor-specific judgement: would this help? If no, why not?
The labs are deliberately short. There is no full implementation of MoE training, no full Mamba reproduction. The cost-vs-learning math doesn't support it at our scale. The 2-expert MoE lab (lab 00) is intentionally a stub — enough to see the routing softmax fire, the load-balance loss work, the two experts diverge. Not enough to claim "I've implemented Mixtral."
The grammar-tutor test¶
For every technique you read about in this phase, ask:
- What's its bottleneck? Write it in one sentence.
- Does the grammar tutor have that bottleneck? Write the FLOP / memory / latency number for the tutor. Compare.
- What would the technique cost me? New code, new failure modes, new tuning surface.
- Verdict: apply / defer / never.
If you can't answer (1), you don't understand the technique yet. Re-read the theory file. If you can't answer (2), you haven't done the math for the tutor — go back to Phase 17's parameter count and Phase 22's KV-cache math.
For the grammar tutor, your verdicts will almost always be never. That's the lesson.
How this connects to the wider curriculum¶
- Phase 17 built MiniGPT-grammar. Phase 36 considers alternative architectures for that same model and concludes the original choice was right.
- Phase 33 built the inference server with continuous batching. Phase 36 asks "would speculative decoding help my server?" and you'll conclude "no — single-user, low-concurrency, low per-token compute makes the draft model overhead net-negative."
- Phase 35 introduced expert parallelism as a vocabulary item; this phase pays it off by showing what MoE is (so expert parallelism has a referent).
- Phase 38 does cost-vs-quality tradeoffs at the capacity-planning layer; Phase 36's "would this help?" judgements feed into those planning decisions.
- Phase 39 capstone: you'll be tempted to add a flashy architecture to "prove" you learned. Phase 36 inoculates you against that temptation.
So Phase 36 is the phase for resisting fashion in architecture choice. Every other phase builds; this one judges.
One-paragraph recap¶
Every frontier architecture in this phase was designed to relieve a specific bottleneck — parameter count (MoE), KV-cache memory (MLA), quadratic attention (Mamba), decode latency (speculative decoding). The grammar tutor has none of these bottlenecks. Phase 36's mission is to teach the reflex of mapping technique → bottleneck → "do I have it?" → verdict. Concept-heavy, light implementation. For the grammar tutor, the honest verdict on every family is "doesn't help here" — and learning to say that is harder than learning to copy the technique.
Next: theory/01-moe.md.