English · Español

Phase 21 — Inference Internals & Sampling¶

Requires: 20 — Evaluation Harness Teaches: sampling · temperature · top-k · top-p · decode-cost Jump to any chapter from the phase reference index.

Chapter map¶

🇪🇸 Después de entrenar el Mini-GPT, ¿cómo sacamos texto de él? Esta fase es la rendija del decoding: dado un modelo entrenado y un prompt, ¿cómo elegimos cada token? Argmax, temperatura, top-k, nucleus. Y la cuenta básica de coste — un forward por token — que motiva la KV cache de Fase 22.

Anchors: LYNX_CORTEX.md §4 / PHASE 21, PHASE_21_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13.

Goal¶

Phase 18–20 trained and evaluated the Mini-GPT on the §A13 verb corpus. Phase 21 covers the inference path: given the trained model and a prompt like "Tomorrow she", how do we draw output tokens?

By phase end Borja can: (a) implement greedy, temperature, top-k, and nucleus (top-p) sampling from scratch in src/minimodel/sampling.py; (b) explain the mathematical effect of temperature on the softmax distribution; © benchmark sampling diversity on verb-completion prompts; (d) reason about the decode cost (one forward pass per generated token), which sets up Phase 22's KV cache.

What you'll build¶

A self-contained sampling module:

def sample(model, prompt: list[int], *, max_tokens: int, seed: int,
           strategy: SamplingStrategy) -> list[int]:
    """Returns the generated tokens (excluding the prompt)."""

Where SamplingStrategy is one of Greedy, Temperature(τ), TopK(k, τ), TopP(p, τ), or a composition. Each strategy is a pure function of the logits.

What this phase does NOT cover¶

KV cache. Phase 22. Phase 21 recomputes the full forward at every step (slow but pedagogically clean).
Beam search. Better in production for translation tasks; overkill for our small corpus.
Speculative decoding. Phase 36.
Streaming output / chunked decode. Phase 33's serving layer.
Stop tokens via training (<eos> learned). Phase 12/18 dealt with the EOS token; Phase 21 just uses it as a termination signal.

Read order¶

theory/00-motivation.md — why sampling matters, the decode loop shape.
theory/01-temperature.md — temperature math and limits.
theory/02-top-k-and-top-p.md — truncation strategies.
theory/03-cost-model.md — decode-time arithmetic intensity; motivation for KV cache.
lab/00-greedy.md — implement greedy decode.
lab/01-temperature-sweep.md — entropy as a function of τ.
lab/02-top-k-and-top-p.md — truncation samplers.
lab/03-diversity-vs-accuracy.md — sampling diversity on verb completions.

solutions/ populated at phase open.

Definition of Done¶

See PHASE_21_PLAN.md §6 (at repo root). Briefly:

All 4 sampler functions implemented in src/minimodel/sampling.py; mypy --strict clean.
Property tests: top-p=1.0 ≡ vanilla; greedy ≡ limit as τ→0; entropy is monotone in τ.
Diversity-vs-correctness curve saved.

Next: theory/00-motivation.md