English · Español
Phase 21 — Inference Internals & Sampling¶
Requires: 20 — Evaluation Harness Teaches:
sampling·temperature·top-k·top-p·decode-costJump to any chapter from the phase reference index.
Chapter map¶
🇪🇸 Después de entrenar el Mini-GPT, ¿cómo sacamos texto de él? Esta fase es la rendija del decoding: dado un modelo entrenado y un prompt, ¿cómo elegimos cada token? Argmax, temperatura, top-k, nucleus. Y la cuenta básica de coste — un forward por token — que motiva la KV cache de Fase 22.
Anchors: LYNX_CORTEX.md §4 / PHASE 21, PHASE_21_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13.
Goal¶
Phase 18–20 trained and evaluated the Mini-GPT on the §A13 verb corpus. Phase 21 covers the inference path: given the trained model and a prompt like "Tomorrow she", how do we draw output tokens?
By phase end Borja can: (a) implement greedy, temperature, top-k, and nucleus (top-p) sampling from scratch in src/minimodel/sampling.py; (b) explain the mathematical effect of temperature on the softmax distribution; © benchmark sampling diversity on verb-completion prompts; (d) reason about the decode cost (one forward pass per generated token), which sets up Phase 22's KV cache.
What you'll build¶
A self-contained sampling module:
def sample(model, prompt: list[int], *, max_tokens: int, seed: int,
strategy: SamplingStrategy) -> list[int]:
"""Returns the generated tokens (excluding the prompt)."""
Where SamplingStrategy is one of Greedy, Temperature(τ), TopK(k, τ), TopP(p, τ), or a composition. Each strategy is a pure function of the logits.
What this phase does NOT cover¶
- KV cache. Phase 22. Phase 21 recomputes the full forward at every step (slow but pedagogically clean).
- Beam search. Better in production for translation tasks; overkill for our small corpus.
- Speculative decoding. Phase 36.
- Streaming output / chunked decode. Phase 33's serving layer.
- Stop tokens via training (
<eos>learned). Phase 12/18 dealt with the EOS token; Phase 21 just uses it as a termination signal.
Read order¶
theory/00-motivation.md— why sampling matters, the decode loop shape.theory/01-temperature.md— temperature math and limits.theory/02-top-k-and-top-p.md— truncation strategies.theory/03-cost-model.md— decode-time arithmetic intensity; motivation for KV cache.lab/00-greedy.md— implement greedy decode.lab/01-temperature-sweep.md— entropy as a function of τ.lab/02-top-k-and-top-p.md— truncation samplers.lab/03-diversity-vs-accuracy.md— sampling diversity on verb completions.
solutions/ populated at phase open.
Definition of Done¶
See PHASE_21_PLAN.md §6 (at repo root). Briefly:
- All 4 sampler functions implemented in
src/minimodel/sampling.py;mypy --strictclean. - Property tests: top-p=1.0 ≡ vanilla; greedy ≡ limit as τ→0; entropy is monotone in τ.
- Diversity-vs-correctness curve saved.
Next: theory/00-motivation.md
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 The Curious Case of Neural Text Degeneration — Holtzman et al. · 2019. the paper that introduced nucleus (top-p) sampling.