Skip to content

English · Español

Lab 03 — Speculative decoding family survey + grammar-tutor verdict

Goal: read one paper or blog from each of the speculative-decoding family (vanilla, Medusa, EAGLE, Lookahead). Write a one-page comparison + verdict for the grammar tutor.

Estimated time: 2–3 hours.

Prereq: theory/04-speculative-and-reasoning.md read. Phase 21 (sampling) understood — speculative decoding modifies sampling, so the baseline must be solid.


What you produce

A directory experiments/36-speculative-survey/ containing:

  • summary.md — one-page comparison matrix of vanilla / Medusa / EAGLE / Lookahead.
  • decision-tree.md — small mermaid flowchart: "if your bottleneck is X, use Y."
  • grammar-tutor-verdict.md — short report: which (if any) of these would help the grammar tutor at Phase 33 serving scale.

TODOs

Block A — read the references

Skim, not deep-read. Each of the four references should take ~30 min:

  • Vanilla speculative decoding — Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2023). Focus on Algorithm 1 (the accept/reject rule).
  • Medusa — Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads" (2024). Focus on the multi-head architecture.
  • EAGLE / EAGLE-2 — Li et al. (2024). Focus on how it uses the target model's hidden states to inform the draft.
  • Lookahead decoding — Fu et al., "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding" (2024). Focus on the n-gram pool mechanism.

Capture the SHAs / dates / authors for each citation in summary.md.

Block B — the comparison matrix

In summary.md, fill in this table:

Family What's drafted by What's verified by Acceptance rate (typical) Implementation complexity Best when...
Vanilla Separate small draft model Target model 2-5× speedup Medium Draft model close to target distribution
Medusa Multiple draft heads on target Target itself 2-3× speedup Medium-high (training) Limited compute headroom, no separate draft
EAGLE Hidden-state-informed draft Target 3-4× speedup High Highest gain target
Lookahead Self-history n-gram Target 1.5-2× speedup Low No separate draft, easy retrofit

(Numbers are typical published; verify against the references.)

Then write 2-3 sentences per family explaining the mechanism in plain English (not "their algorithm X does Y" — actual mechanism).

Block C — the decision tree

decision-tree.md — a mermaid flowchart for picking a speculative-decoding strategy:

flowchart TD
    start[I want faster decode]
    start --> q1{Do I have a smaller model<br/>close to the target?}
    q1 -->|yes| vanilla[Vanilla speculative decoding]
    q1 -->|no| q2{Can I afford to retrain<br/>or fine-tune the target?}
    q2 -->|yes| q3{Medusa or EAGLE?}
    q3 -->|low overhead| medusa[Medusa]
    q3 -->|max gain| eagle[EAGLE]
    q2 -->|no| lookahead[Lookahead decoding]

Then add a "for the grammar tutor" branch:

flowchart TD
    gt[Grammar tutor decoding]
    gt --> q1{Is decode the bottleneck?}
    q1 -->|no, batch=1 CPU is fast enough| nochange[No change needed]
    q1 -->|yes, hypothetically| q2{What's the draft model?}
    q2 -->|n-gram from Phase 14| vanilla_gt[Vanilla spec. dec. with n-gram draft]
    vanilla_gt -.but.-> noop[Cost / benefit at single-user low-batch:<br/>likely net-negative.]

Block D — the grammar-tutor verdict

grammar-tutor-verdict.md (~200 words):

  • Is decode latency a bottleneck for the grammar tutor in single-user mode? (Answer: 3 ms per token on CPU, ~12 output tokens for a typical correction = 36 ms total. Below human-perceptible-delay threshold.)
  • Could any speculative decoding variant help at the Phase 33 serving scale (continuous batching, multi-user)? (Answer: marginal — the target model is so small that batching dominates the wins; spec. dec. would compound but isn't necessary.)
  • Of the four, which is the easiest to retrofit on the existing grammar tutor with no retraining? (Answer: Lookahead. No draft model, no retraining.)
  • What would the honest recommendation be? (Answer: don't add speculative decoding. The grammar tutor doesn't need it. If decode latency ever becomes a real problem, Lookahead first; vanilla with n-gram draft as a backup.)

End with the meta-lesson: "speculative decoding is a real technique with real wins at appropriate scale. The grammar tutor is below that scale. Defer."

Constraints

  • No implementation. Pure reading + analysis. The point is to internalize the family, not to build it.
  • Cite explicitly. Each claim in summary.md cites a section or table of one of the four papers / blogs.
  • Honest verdicts. If you find yourself wanting to add speculative decoding to "look modern", catch the impulse. The verdict for the grammar tutor at the current scale is "defer".

Stop conditions

You're done when:

  1. experiments/36-speculative-survey/{summary.md, decision-tree.md, grammar-tutor-verdict.md} all exist.
  2. The comparison matrix in summary.md is filled in with sourced numbers.
  3. The decision-tree mermaid renders.
  4. The grammar-tutor verdict answers all four questions.
  5. You can recite, from memory, "speculative decoding works because decode is memory-bound; getting multiple tokens per forward improves arithmetic intensity."

Hint of last resort

If you find the papers' notation impenetrable (each uses different symbols for the same things), use a unifying notation in your summary.md. Define \(p_T\) (target probability), \(p_D\) (draft probability), \(k\) (proposed length per round), \(\alpha\) (acceptance rate). Convert each paper's notation into yours.

If you can't decide between EAGLE and Medusa for the "max-gain" branch: skim the EAGLE-2 paper's comparison table — they directly benchmark against Medusa with the same target model.

When to consult solutions/

After committing. Solution lives in solutions/03-speculative-ref.md — written at phase open. The reference includes the published-number ranges for the comparison matrix and the "defer" verdict for the grammar tutor with the same reasoning.


Next phase: docs/phase-37-security-safety/.