English · Español
04 — Speculative decoding and "reasoning" models¶
🇪🇸 Dos técnicas que no son cambios de arquitectura, sino de cómo se usa el modelo en inferencia: speculative decoding (un modelo pequeño propone, el grande verifica; aceleras decode 2-5×) y reasoning models (más cómputo en test-time, vía cadenas-de-pensamiento entrenadas con RL). Ambas resuelven cuellos distintos al de la latencia por token; ninguna es relevante para el tutor de gramática.
The first three theory files covered architectural changes — MoE, MLA, Mamba. This file covers two non-architectural techniques that often appear in the same "frontier" conversations:
- Speculative decoding — a decoding trick. Same model. Multiplies tokens-per-forward-pass by 2–5× at the cost of complexity.
- Reasoning models — a training and inference-time-compute trick. Different model behavior (long chain-of-thought before answering). Multiplies quality on reasoning tasks at the cost of latency and tokens-spent.
Both are recently popular. Both have specific bottlenecks they relieve. Neither helps the grammar tutor. We'll cover them quickly.
Speculative decoding¶
The idea¶
Decode is memory-bound (Phase 21, Phase 22). Each token requires loading all the weights from HBM into the SMs. The arithmetic intensity is low; the GPU's tensor cores idle most of the time.
If we could get multiple tokens per forward pass, the same memory traffic would yield more useful work. Speculative decoding does exactly this:
- A draft model (much smaller, faster) proposes the next \(k\) tokens autoregressively: \(\hat{x}_{t+1}, \ldots, \hat{x}_{t+k}\).
- The target (real) model runs ONE forward pass on the proposed sequence, producing distributions for each of the \(k\) positions in parallel.
- We verify: for each \(i\), compare the target's distribution at position \(t+i\) to the draft's at the same position. Accept \(\hat{x}_{t+i}\) with probability \(\min(1, p_{\text{target}}(\hat{x}_{t+i}) / p_{\text{draft}}(\hat{x}_{t+i}))\). Stop at the first rejection.
- The accepted tokens are exactly what the target would have produced (mathematically: it's importance sampling). Quality identical to standalone target inference; speed amortized over multiple accepted tokens per target forward.
Expected tokens per forward: depends on draft-target agreement. Typical: 2–5× speedup. Higher when draft matches target closely (small task domain, like grammar); lower in adversarial or out-of-distribution prompts.
The family¶
- Vanilla speculative decoding (Leviathan et al., 2023): draft + target, as above.
- Medusa (Cai et al., 2024): instead of a separate draft, attach multiple "draft heads" to the target — different output heads predict \(t+1, t+2, t+3\). Same forward pass produces \(k\) proposals. Cleaner than a separate model.
- EAGLE (Li et al., 2024): better draft architecture, learns from the target's hidden states. Higher acceptance rate.
- Lookahead decoding (Fu et al., 2024): no separate draft. The target proposes future tokens by its own \(n\)-gram-like history. Self-speculative.
All achieve the same goal — more tokens per forward — with different complexity/quality tradeoffs.
Would speculative decoding help the grammar tutor?¶
Test:
- Bottleneck: decode latency on a memory-bound, single-user, low-batch inference workload.
- Does the grammar tutor have it? The grammar tutor is single-user and small-batch. Per-token decode time on local CPU is ~3 ms. The corpus is so tiny that even a "draft" model would be barely smaller than the target.
- What would speculative decoding cost? A separate draft model (e.g., the Phase 14 n-gram baseline) plus the verification logic. Maintenance burden. New failure modes (acceptance-rate drift).
- Verdict: Defer. Not "never" — it's plausible. But the gain at our scale (3 ms → 1 ms) is marginal vs the implementation complexity.
A counterfactual¶
If the grammar tutor served 10,000 concurrent users via the Phase 33 server with continuous batching, and each user's queries were small (short outputs), and the GPU were memory-bound: yes, speculative decoding would help. The n-gram baseline from Phase 14 would be an excellent draft for short, narrow-task generations. Could yield 3–4× tokens/sec for free.
But that's a Phase 38 capacity-planning consideration, not a frontier-architecture concern.
Reasoning models¶
The idea¶
Standard LLMs generate tokens autoregressively. The amount of compute spent per answer is roughly proportional to the length of the answer.
Reasoning models (e.g., OpenAI o1, DeepSeek R1, Claude with extended thinking) generate a long chain of internal thinking tokens before producing the final answer. The thinking is much longer than the answer; the model trades latency for quality. Critically, the model is trained (via RL or similar) to produce useful thinking, not just any tokens.
The "test-time compute scaling" claim: doubling the thinking-token budget improves performance on hard reasoning tasks (math, code, multi-step planning) more than doubling the model size would. Compute spent at inference is now a quality knob, not just a latency cost.
Why this isn't an architecture change¶
The architecture is still a dense transformer. The change is in training data and reward (RL on reasoning traces) and inference policy (allow long internal generation before answer). You don't "add reasoning" by changing the model; you train it that way.
Would reasoning help the grammar tutor?¶
Test:
- Bottleneck: complex multi-step problems where one-shot generation underperforms.
- Does the grammar tutor have it? No. Conjugation correction is one-step. "He goed to school" → "He went to school" requires the model to: identify the verb (one lookup), identify the tense (one lookup), find the irregular past form (one lookup). A well-trained dense model gets this in one shot. Chain-of-thought helps zero.
- What would adding reasoning cost? RL training data (we have a deterministic corpus, not graded reasoning traces). New training pipeline. Higher inference cost (more tokens per answer).
- Verdict: Never. The grammar tutor's task is exactly not a reasoning task. Adding chain-of-thought would slow it down with no quality gain.
The honest caveat¶
For a grammar-tutor agent (Phase 32) that handles ambiguous cases — "is 'fewer' or 'less' correct here?" requires considering context — some additional reasoning could help. But that's a few tokens of "let me check the context...", not the multi-thousand-token chains-of-thought that frontier reasoning models produce. The right framework is the simple constrained-decoding flow from Phase 30 + Phase 32, not o1-style RL-trained reasoning.
Putting it together: the architecture decision tree¶
Phase 36 ends with a decision tree (committed as a mermaid diagram in diagrams/):
Start: am I dissatisfied with my current model's behavior?
├── Yes, it's too big to fit on my GPUs
│ ├── Inference: TP / FSDP / MLA (for KV cache)
│ └── Training: ZeRO-3 / FSDP / TP
├── Yes, training is too slow at fixed capacity
│ ├── Add more GPUs: DDP
│ └── Add more compute per parameter: dense scale-up
├── Yes, my model lacks capacity at fixed compute budget
│ └── MoE (if you can afford the routing complexity)
├── Yes, my context is so long that attention explodes
│ ├── O(10k tokens): RoPE + chunking (Phase 16)
│ ├── O(100k tokens): MLA + GQA
│ └── O(1M+ tokens): Mamba (hybrid with attention layers)
├── Yes, my decode latency is too high
│ └── Speculative decoding (if you have a good draft model)
├── Yes, my model fails on multi-step reasoning
│ └── Train with chain-of-thought RL (reasoning models)
└── No, my model is fine
└── Don't change anything. <-- GRAMMAR TUTOR LIVES HERE
The tree's load-bearing leaf for the grammar tutor is the last one. Phase 36 is the phase where Borja learns to confidently reach that leaf.
What this phase does NOT cover¶
- Implementing speculative decoding. Survey only. Phase 30 (structured generation) could host an experiment; Phase 36 doesn't.
- RL training for reasoning models. Far out of scope — RLHF/RLAIF was already pushed out of Phase 28 as concept-only.
- Comparing reasoning models (o1 vs R1 vs DeepThink): vendor benchmarking, not curriculum material.
- Self-consistency, tree-of-thought, and other test-time-compute variants. Mentioned in passing; not derived.
- Multi-modal architectures (vision-language, audio). Out of scope; §4 explicitly defers.
Next: lab/00-moe-on-grammar-tutor.md.