English · Español
Frontier & disruptive concepts¶
The 42-phase curriculum builds the durable stack — the ideas that will still be true in ten years. This page tracks the moving frontier: techniques that are reshaping the field right now. Each entry is a one-paragraph orientation, not a derivation, with a pointer to where it connects to the curriculum.
How to use this page
Read it once for the map, then return to it when a phase mentions a frontier idea. None of this is required to finish the curriculum — it is the "what's next" beyond Phase 36 and Phase 40.
Architectures beyond attention¶
- State-space models (Mamba / S4). Sequence models whose cost grows linearly with length instead of quadratically, using a selective scan instead of all-pairs attention. They challenge the transformer on long context. Connects to Phase 14 (recurrence) and Phase 36.
- Mixture-of-Experts (MoE). Replace one big feed-forward block with many "expert" blocks and a router that activates only a few per token — more parameters, roughly constant compute per token. The dominant way to scale capacity cheaply. Connects to Phase 17 and Phase 36.
- Linear & hybrid attention. Kernelized or low-rank approximations of attention (and hybrids that interleave attention with SSM layers) trade a little quality for sub-quadratic cost.
Faster, cheaper attention & decoding¶
- FlashAttention-2 / 3. Successive rewrites of exact attention that keep the softmax in fast SRAM and overlap work with the GPU's tensor cores — large speedups with identical math. Connects to Phase 27.
- Grouped-query & multi-query attention (GQA / MQA / MLA). Share key/value heads across query heads to shrink the KV cache — the single biggest lever on decode memory. MLA (multi-head latent attention) compresses the cache further. Connects to Phase 22 and Phase 27.
- Speculative & self-speculative decoding (Medusa, EAGLE). A small draft model proposes several tokens; the big model verifies them in one pass — fewer sequential steps, same output distribution. Connects to Phase 21 and Phase 36.
- KV-cache quantization & compression. Store the KV cache in INT8/INT4 or evict low-attention tokens to fit longer context in the same memory.
Context & position¶
- RoPE scaling (YaRN, NTK-aware, position interpolation). Cheap tricks that stretch a model trained at 4k tokens to 32k–128k by reshaping the rotary frequencies. Connects to Phase 16.
- Ring & context parallelism. Shard a single very long sequence across GPUs so attention can span millions of tokens. Connects to Phase 35.
Training & precision¶
- FP8 and microscaling (MXFP) training. Hardware now trains in 8-bit floats with per-block scales — roughly double the throughput of BF16. Connects to Phase 02 and Phase 26.
- Muon / Shampoo (matrix-aware optimizers). Second-order-ish optimizers that precondition gradients with matrix structure, converging faster than Adam on large models. Connects to Phase 04.
Alignment & reasoning¶
- DPO and its successors (IPO, KTO, ORPO). Preference alignment without a separate reward model or PPO loop — a classification-style loss directly on preferred/rejected pairs. Connects to Extension X3.
- Test-time compute / reasoning models. Models trained to spend more tokens "thinking" (long chains of thought, search, self-verification) before answering — trading inference cost for accuracy on hard problems. Connects to Phase 32.
- RLAIF & Constitutional AI. Replace human preference labels with feedback from another model guided by a written constitution. Connects to Extension X3.
How this maps to the curriculum¶
Everything above is an optimization of, or successor to, a primitive you build by hand in the core phases. That is the point: once you have derived attention, the KV cache, sampling, and quantization from scratch, each frontier entry reads as a known quantity with one new idea — not as magic.