Skip to content

English · Español

Frontier & disruptive concepts

The 42-phase curriculum builds the durable stack — the ideas that will still be true in ten years. This page tracks the moving frontier: techniques that are reshaping the field right now. Each entry is a one-paragraph orientation, not a derivation, with a pointer to where it connects to the curriculum.

How to use this page

Read it once for the map, then return to it when a phase mentions a frontier idea. None of this is required to finish the curriculum — it is the "what's next" beyond Phase 36 and Phase 40.

Architectures beyond attention

  • State-space models (Mamba / S4). Sequence models whose cost grows linearly with length instead of quadratically, using a selective scan instead of all-pairs attention. They challenge the transformer on long context. Connects to Phase 14 (recurrence) and Phase 36.
  • Mixture-of-Experts (MoE). Replace one big feed-forward block with many "expert" blocks and a router that activates only a few per token — more parameters, roughly constant compute per token. The dominant way to scale capacity cheaply. Connects to Phase 17 and Phase 36.
  • Linear & hybrid attention. Kernelized or low-rank approximations of attention (and hybrids that interleave attention with SSM layers) trade a little quality for sub-quadratic cost.

Faster, cheaper attention & decoding

  • FlashAttention-2 / 3. Successive rewrites of exact attention that keep the softmax in fast SRAM and overlap work with the GPU's tensor cores — large speedups with identical math. Connects to Phase 27.
  • Grouped-query & multi-query attention (GQA / MQA / MLA). Share key/value heads across query heads to shrink the KV cache — the single biggest lever on decode memory. MLA (multi-head latent attention) compresses the cache further. Connects to Phase 22 and Phase 27.
  • Speculative & self-speculative decoding (Medusa, EAGLE). A small draft model proposes several tokens; the big model verifies them in one pass — fewer sequential steps, same output distribution. Connects to Phase 21 and Phase 36.
  • KV-cache quantization & compression. Store the KV cache in INT8/INT4 or evict low-attention tokens to fit longer context in the same memory.

Context & position

  • RoPE scaling (YaRN, NTK-aware, position interpolation). Cheap tricks that stretch a model trained at 4k tokens to 32k–128k by reshaping the rotary frequencies. Connects to Phase 16.
  • Ring & context parallelism. Shard a single very long sequence across GPUs so attention can span millions of tokens. Connects to Phase 35.

Training & precision

  • FP8 and microscaling (MXFP) training. Hardware now trains in 8-bit floats with per-block scales — roughly double the throughput of BF16. Connects to Phase 02 and Phase 26.
  • Muon / Shampoo (matrix-aware optimizers). Second-order-ish optimizers that precondition gradients with matrix structure, converging faster than Adam on large models. Connects to Phase 04.

Alignment & reasoning

  • DPO and its successors (IPO, KTO, ORPO). Preference alignment without a separate reward model or PPO loop — a classification-style loss directly on preferred/rejected pairs. Connects to Extension X3.
  • Test-time compute / reasoning models. Models trained to spend more tokens "thinking" (long chains of thought, search, self-verification) before answering — trading inference cost for accuracy on hard problems. Connects to Phase 32.
  • RLAIF & Constitutional AI. Replace human preference labels with feedback from another model guided by a written constitution. Connects to Extension X3.

How this maps to the curriculum

Everything above is an optimization of, or successor to, a primitive you build by hand in the core phases. That is the point: once you have derived attention, the KV cache, sampling, and quantization from scratch, each frontier entry reads as a known quantity with one new idea — not as magic.