English · Español

04 — Top-p, by hand, on a 10-token vocab¶

🇪🇸 Top-p (nucleus sampling) suena exótico. No lo es: es "ordena por probabilidad descendente, suma hasta p, descarta el resto, renormaliza". Lo derivamos a mano sobre 10 tokens, mostramos el corte exacto, y vemos por qué p=0.95 es el sweet spot práctico para el tutor de gramática §A13.

This page is the numerical companion to theory/02-top-k-and-top-p.md. We work through a complete top-p example on a synthetic 10-token logit vector so the cut-off arithmetic is fully visible.

Setup¶

The mini-GPT at some step produces these raw logits for 10 candidate next-tokens (a synthetic but realistic distribution after a moderate prompt like She has wri___):

Index	Token	Logit
0	`tten`	4.20
1	`tes`	1.80
2	`ting`	1.50
3	`te`	0.60
4	`t`	0.30
5	`tt`	-0.50
6	`s`	-1.20
7	`nt`	-2.10
8	`den`	-3.40
9	`er`	-4.00

(In a real run, the vocab is ~512 tokens; we're focusing on the top 10 for visibility.)

Step 1 — Softmax¶

Subtract the max for numerical stability (\(4.20\)):

Index	Logit − max	exp(·)
0	0.00	1.0000
1	-2.40	0.0907
2	-2.70	0.0672
3	-3.60	0.0273
4	-3.90	0.0202
5	-4.70	0.0091
6	-5.40	0.0045
7	-6.30	0.0018
8	-7.60	0.0005
9	-8.20	0.0003

Sum: \(Z = 1.0000 + 0.0907 + 0.0672 + 0.0273 + 0.0202 + 0.0091 + 0.0045 + 0.0018 + 0.0005 + 0.0003 = 1.2216\)

Normalize \(p_i = \exp(\cdot)_i / Z\):

Index	Token	\(p_i\)
0	`tten`	0.8186
1	`tes`	0.0742
2	`ting`	0.0550
3	`te`	0.0223
4	`t`	0.0165
5	`tt`	0.0075
6	`s`	0.0037
7	`nt`	0.0015
8	`den`	0.0004
9	`er`	0.0003

Sum check: \(\approx 1.0000\) ✓

Step 2 — Sort descending (already sorted in our example)¶

Step 3 — Cumulative sum¶

Rank	Token	\(p\)	Cumulative
1	`tten`	0.8186	0.8186
2	`tes`	0.0742	0.8928
3	`ting`	0.0550	0.9478
4	`te`	0.0223	0.9701
5	`t`	0.0165	0.9866
6	`tt`	0.0075	0.9941
7	`s`	0.0037	0.9978
8	`nt`	0.0015	0.9993
9	`den`	0.0004	0.9997
10	`er`	0.0003	1.0000

Step 4 — Apply the top-p cutoff¶

The rule: keep the smallest set of top-ranked tokens whose cumulative probability is at least \(p\). We then renormalize over the kept set.

For different values of \(p\):

\(p = 1.0\) (no truncation)¶

Cumulative reaches 1.0 only at rank 10. All 10 tokens kept. No truncation — equivalent to plain sampling from the full softmax. Diversity is maximal; tail tokens can sneak in.

\(p = 0.95\) (the §A13 recommended default)¶

Cumulative crosses 0.95 between rank 3 (0.9478) and rank 4 (0.9701). The smallest set with cumulative \(\geq 0.95\) is {rank 1, 2, 3} with cumulative 0.9478 — wait. 0.9478 < 0.95. So we need rank 4 too. The set is {rank 1, 2, 3, 4} with cumulative 0.9701.

Renormalize over the 4 kept tokens:

Token	\(p\)	\(p / 0.9701\)
`tten`	0.8186	0.8438
`tes`	0.0742	0.0765
`ting`	0.0550	0.0567
`te`	0.0223	0.0230

The 6 tail tokens (t, tt, s, nt, den, er) are zeroed out. The probability mass that was in those 6 is redistributed proportionally to the top 4.

\(p = 0.5\) (aggressive truncation)¶

Cumulative crosses 0.5 already at rank 1 (0.8186). So we keep only the single top token tten. After renormalization, \(p_\text{tten} = 1.0\) — this is identical to greedy decoding for this prompt.

For prompts where the distribution is flatter (e.g., open-ended generation), \(p = 0.5\) might keep 3-5 tokens; for sharply-peaked distributions like this one (the model is very confident in tten), it collapses to greedy.

Cutoff math, formal¶

The kept set is:

\[S_p = \{i_1, i_2, ..., i_k\} \quad \text{where} \quad \sum_{j=1}^{k-1} p_{i_j} < p \leq \sum_{j=1}^{k} p_{i_j}\]

Then \(\tilde p_i = p_i / \sum_{j \in S_p} p_j\) for \(i \in S_p\), else \(\tilde p_i = 0\). Sample from \(\tilde p\).

Why this matters for the §A13 grammar tutor¶

The Phase 32 capstone is a grammar tutor that proposes corrections. Tuning generation:

p = 1.0 is bad. It lets the tail (low-probability conjugations like er from this distribution) sneak into outputs. The tutor would occasionally suggest wildly wrong forms.
p = 0.5 is bad. It collapses to greedy on confident distributions, removing the tutor's ability to consider multiple plausible completions (useful for sentences with truly ambiguous conjugations, e.g., when both regular and irregular past forms exist for a verb).
p = 0.95 is the sweet spot. It cuts the long-tail garbage but preserves the top-\(k\) legitimate candidates where \(k\) varies adaptively with the entropy of the distribution. For confident steps, \(k = 1\) or \(2\); for ambiguous steps, \(k\) can be larger.

The lab in this phase (02-top-k-and-top-p.md) explores the sweep; the break in this phase (break/00-...) shows what p = 1.0 and p = 0.5 do wrong on real prompts.

Comparison with top-k¶

Top-k = "keep the top \(k\) tokens regardless of probability mass". Issue: \(k = 50\) on a sharp distribution wastes 49 slots; \(k = 5\) on a flat distribution drops too much mass. Top-p is adaptive — it adjusts the kept-set size based on the distribution's actual entropy.

Both can be combined: "top-50 then top-p 0.95". This is the OpenAI default for older models. At §A13 scale (vocab 512), combining is unnecessary — top-p alone is fine.

Numerical implementation gotcha¶

Computing cumulative probabilities in fp16 is risky for long-tail distributions: the tail tokens have \(p \sim 10^{-5}\), which is at the edge of fp16 precision. Always do top-p in fp32 or higher, even if the model runs in fp16. The 1-2 ms overhead is irrelevant; the correctness gain is real.

Citation¶

Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. ICLR 2020. https://arxiv.org/abs/1904.09751 — introduces nucleus (top-p) sampling, motivates it from the "long tail of unnatural completions" problem. Sections 3.1 (the math) and 4 (empirical comparison with top-k and temperature) are the source for the framing here.

One-paragraph recap¶

Top-p sampling: softmax → sort descending → cumulate → find the smallest prefix with cumulative \(\geq p\) → renormalize over that prefix → sample. On a synthetic 10-token distribution from a "She has wri___" continuation, \(p = 0.95\) keeps the top 4 tokens (cumulative 0.9701, the smallest prefix exceeding 0.95). \(p = 1.0\) keeps all 10, including garbage. \(p = 0.5\) collapses to greedy on this confident distribution. For the §A13 grammar tutor, \(p = 0.95\) is the recommended default — adaptive to per-step entropy without admitting tail tokens.

Cross-refs: theory/02-top-k-and-top-p.md (formulas in full), lab/02-top-k-and-top-p.md (sweep), Phase 32 — the grammar tutor uses these settings.