English · Español
02 — Top-k and top-p (nucleus) truncation¶
🇪🇸 Temperatura escala la distribución; truncado la recorta. Top-k mantiene los k tokens con mayor logit. Top-p mantiene el conjunto más pequeño cuya masa cumulativa supera p. Top-k es fijo; top-p adapta. Ambos se combinan con temperatura.
Top-k¶
Given logits \(z \in \mathbb{R}^V\) and \(k \in \{1, \ldots, V\}\):
- Find the \(k\) largest logits; call their indices \(\mathcal{S}_k\).
- Zero out (set to \(-\infty\)) all logits not in \(\mathcal{S}_k\).
- Apply softmax to the result.
Easy to reason about: "the model picks among its top \(k\) guesses." Easy to set: \(k = 40\) or \(k = 50\) are common defaults in literature.
Failure mode: when the distribution is flat (model is uncertain), top-\(k\) still picks exactly \(k\) tokens — including some with very low probability. This wastes the "uncertainty" signal: a flat distribution should produce more diverse outputs, not the same \(k\) tokens that happen to lead.
Top-p (nucleus)¶
Holtzman et al. 2020 ("The Curious Case of Neural Text Degeneration") introduced nucleus sampling to address top-k's flat-distribution problem.
Given logits \(z\) and \(p \in (0, 1]\):
- Compute \(q = \text{softmax}(z)\).
- Sort tokens by descending probability: \(q_{i_1} \ge q_{i_2} \ge \ldots \ge q_{i_V}\).
- Find the smallest \(K^*\) such that \(\sum_{r=1}^{K^*} q_{i_r} \ge p\).
- Zero out tokens \(i_{K^* + 1}, \ldots, i_V\); renormalise the remaining.
The set \(\{i_1, \ldots, i_{K^*}\}\) is the nucleus. Its size adapts:
- Peaked distribution (e.g.,
q_{i_1} = 0.9): \(K^*\) might be 1 or 2. Top-\(p\) behaves like greedy or near-greedy. - Flat distribution (e.g., uniform): \(K^*\) might be hundreds. Top-\(p\) samples from a wide pool.
Common values: \(p = 0.9\) or \(p = 0.95\). Sometimes \(p = 0.92\) for more diversity.
Top-k vs top-p, side by side¶
| Logits over 5 tokens | Top-k (k=2) | Top-p (p=0.9) |
|---|---|---|
[5, 4, 0, 0, 0] (peaked) |
{5, 4} | {5} (since 5 alone exceeds p=0.9 of mass) |
[2, 1.9, 1.8, 1.7, 0] (flat) |
{2, 1.9} | {2, 1.9, 1.8, 1.7} (need ~all of them to get 0.9) |
Top-p is the better default when the model's confidence varies. Top-k is cheaper to compute (no sort of the full distribution, just a partial top-k).
For our small vocabulary (\(V = 64\)), the sort cost is negligible. Use top-p.
Combining temperature with truncation¶
You can apply temperature first (to all logits), then truncate. The order matters because temperature changes which tokens are in the "top" set if the truncation is by rank (top-k); for top-p, since softmax is monotone, the relative order of tokens is preserved but the cumulative mass distribution shifts.
Common recipe: Temperature(τ=0.7) ∘ TopP(p=0.9). Apply temperature, then truncate, then sample.
def temperature_then_topp(logits, tau, p, rng):
scaled = logits / tau
probs = softmax(scaled)
sorted_probs, sorted_idx = sort_descending(probs)
cumsum = np.cumsum(sorted_probs)
cutoff = np.searchsorted(cumsum, p) + 1 # smallest K* such that sum >= p
nucleus = set(sorted_idx[:cutoff])
truncated = np.where(in_set(np.arange(len(probs)), nucleus), probs, 0)
truncated /= truncated.sum() # renormalise
return rng.choice(len(probs), p=truncated)
Property: top-p = 1.0 is a no-op¶
For \(p = 1.0\), the nucleus is the full vocabulary; the distribution is unchanged. This is a useful sanity test: TopP(p=1.0) composed with anything should produce identical output to the un-truncated version.
Lab 02 will test this. If your top-p implementation produces different results at \(p = 1.0\), you have an off-by-one in the cumulative-sum or threshold check.
Property: top-k = 1 ≈ greedy¶
If k = 1, only the argmax token survives, and sampling from a single-token distribution is deterministic. So TopK(k=1) + sample() is greedy decoding.
(Subtle: if there are ties in the logits at the argmax, top-k=1 with a stable sort picks one; argmax typically picks the first index. They can disagree on tie-breaking. Rare in practice.)
A pitfall: top-p with peaked distributions¶
If the model is very confident (q_{i_1} = 0.99), then \(K^* = 1\) even for \(p = 0.95\). Top-p collapses to greedy whenever the model is more than \(p\)-confident.
This is correct behaviour. The trade-off is: top-p is "diverse-when-uncertain, greedy-when-confident." Top-k is "always-k-options," which can be wasteful or wrong depending on the situation.
For our trained Mini-GPT, on prompts like "Tomorrow she", the model's distribution over the next token is often peaked toward will or is. So top-p=0.9 sampling might always pick will on that prompt — and you'd see no diversity. That's the model being confident, not the sampler being broken.
What this file does NOT cover¶
- Typical sampling (Meister et al. 2023). A more sophisticated truncation. Out of scope.
- Min-p sampling (Nguyen et al. 2023). Yet another variant. Out of scope.
- Beam search. Heuristic search over \(k\) beams; better for translation, not for open generation.
Next: 03-cost-model.md