Skip to content

English · Español

Phase 21 — Inference Internals & Sampling

Requires: 20 — Evaluation Harness Teaches: sampling · temperature · top-k · top-p · decode-cost Jump to any chapter from the phase reference index.

Chapter map

🇪🇸 Después de entrenar el Mini-GPT, ¿cómo sacamos texto de él? Esta fase es la rendija del decoding: dado un modelo entrenado y un prompt, ¿cómo elegimos cada token? Argmax, temperatura, top-k, nucleus. Y la cuenta básica de coste — un forward por token — que motiva la KV cache de Fase 22.

Anchors: LYNX_CORTEX.md §4 / PHASE 21, PHASE_21_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13.

Goal

Phase 18–20 trained and evaluated the Mini-GPT on the §A13 verb corpus. Phase 21 covers the inference path: given the trained model and a prompt like "Tomorrow she", how do we draw output tokens?

By phase end Borja can: (a) implement greedy, temperature, top-k, and nucleus (top-p) sampling from scratch in src/minimodel/sampling.py; (b) explain the mathematical effect of temperature on the softmax distribution; © benchmark sampling diversity on verb-completion prompts; (d) reason about the decode cost (one forward pass per generated token), which sets up Phase 22's KV cache.

What you'll build

A self-contained sampling module:

def sample(model, prompt: list[int], *, max_tokens: int, seed: int,
           strategy: SamplingStrategy) -> list[int]:
    """Returns the generated tokens (excluding the prompt)."""

Where SamplingStrategy is one of Greedy, Temperature(τ), TopK(k, τ), TopP(p, τ), or a composition. Each strategy is a pure function of the logits.

What this phase does NOT cover

  • KV cache. Phase 22. Phase 21 recomputes the full forward at every step (slow but pedagogically clean).
  • Beam search. Better in production for translation tasks; overkill for our small corpus.
  • Speculative decoding. Phase 36.
  • Streaming output / chunked decode. Phase 33's serving layer.
  • Stop tokens via training (<eos> learned). Phase 12/18 dealt with the EOS token; Phase 21 just uses it as a termination signal.

Read order

  1. theory/00-motivation.md — why sampling matters, the decode loop shape.
  2. theory/01-temperature.md — temperature math and limits.
  3. theory/02-top-k-and-top-p.md — truncation strategies.
  4. theory/03-cost-model.md — decode-time arithmetic intensity; motivation for KV cache.
  5. lab/00-greedy.md — implement greedy decode.
  6. lab/01-temperature-sweep.md — entropy as a function of τ.
  7. lab/02-top-k-and-top-p.md — truncation samplers.
  8. lab/03-diversity-vs-accuracy.md — sampling diversity on verb completions.

solutions/ populated at phase open.

Definition of Done

See PHASE_21_PLAN.md §6 (at repo root). Briefly:

  • All 4 sampler functions implemented in src/minimodel/sampling.py; mypy --strict clean.
  • Property tests: top-p=1.0 ≡ vanilla; greedy ≡ limit as τ→0; entropy is monotone in τ.
  • Diversity-vs-correctness curve saved.

Next: theory/00-motivation.md

Further reading

Optional — enrichment, not required to pass the phase.