Skip to content

English · Español

00 — The decode loop and why sampling matters

🇪🇸 El modelo emite logits, no texto. La función que los convierte en un token concreto es el sampler. Esa función es donde se decide si el modelo suena correcto, creativo, repetitivo o roto.

The decode loop in one screen

After Phase 18 trained a Mini-GPT, generating text from it is literally this:

def generate(model, prompt: list[int], *, max_tokens: int, sampler) -> list[int]:
    tokens = list(prompt)                                 # work with a mutable copy
    for _ in range(max_tokens):
        logits = model(np.array(tokens))                  # (T, V)
        next_logits = logits[-1]                          # (V,) — only the last position matters
        next_token = sampler(next_logits)                 # int
        tokens.append(next_token)
        if next_token == EOS_TOKEN_ID:
            break
    return tokens[len(prompt):]                           # only the generated suffix

Five operations: forward, slice the last position, sample, append, check stop. Repeat. That's it.

What changes between greedy and nucleus and "GPT-4-style sampling" is just the sampler function. The decode loop itself is shared.

What "sampler" can mean

Three families:

  1. Deterministic. Greedy (argmax) always returns the same token for the same logits. No randomness.
  2. Stochastic with the model's full distribution. Temperature-scaled softmax + sampling. The model's entire probability vector is consulted.
  3. Stochastic with truncation. Top-k or top-p first zero out unlikely tokens, then sample from the remaining mass. Common in production because pure full-distribution sampling can produce low-probability tokens occasionally.

Each family makes a different diversity vs correctness trade-off, which lab 03 will measure.

Why we care about sampling on the verb corpus

For "Tomorrow she" as a prompt, the trained Mini-GPT should output a token that is:

  1. Syntactically valid — a verb form, not a punctuation token.
  2. Tense-correctwill go, is going to go, or similar future-tense forms.
  3. Person-correct — third-person singular, not first-person.

The model's logits encode probabilities for all of these. The sampler chooses which one we observe. With greedy, we get the single most likely continuation, every time. With temperature > 1, we get variety (sometimes "will go," sometimes "is going to walk"). With temperature too high, we get garbage (random tokens).

Tuning the sampler is part of what makes Phase 32's tutor agent feel competent vs robotic.

A subtle point about probability vs logits

The model outputs logits (unnormalised log-probabilities). To get probabilities you apply softmax. Many samplers can operate on the logits directly without computing softmax — argmax doesn't care about the normalisation, top-k just ranks, only temperature scaling and nucleus need the explicit probability values.

Numerical hygiene: if you compute softmax then sample, use the log-sum-exp trick (Phase 05). If you can sample without ever materialising the full probability vector (e.g., the Gumbel-max trick), even better. For our \(V = 64\) this doesn't matter; for \(V = 50000\) it would.

The cost: one forward pass per token

Each iteration of the loop calls model(tokens) — a full forward through the Mini-GPT on a sequence of length \(T_\text{current}\). As \(T_\text{current}\) grows, each step gets more expensive:

  • Token 0: forward on length \(L\) (the prompt).
  • Token 1: forward on length \(L + 1\).
  • Token 2: forward on length \(L + 2\).
  • ...
  • Token \(T\): forward on length \(L + T\).

Total cost: \(\sum_{t=0}^{T-1} O((L + t) \cdot d^2) \approx O(T(L + T) d^2)\). For \(L = 8, T = 20, d = 64\): ~\(28 \cdot 4096 = 115{,}000\) "operations" — fast on a CPU.

But notice: token 0's forward sees [prompt]; token 1's forward sees [prompt, gen_0]. The first \(L\) positions don't change between calls. Most of the forward computation is redundantly recomputed every step.

This is why Phase 22 introduces the KV cache: cache the attention keys/values from earlier positions so we only compute the new last position on each step. Phase 21 explicitly does not use a cache so you feel the cost.

What this file does NOT cover

  • The math of temperature. Next file.
  • Truncation strategies. File 02.
  • Cache implementation. Phase 22.

Next: 01-temperature.md