English · Español

Break — top-p with p=1.0 vs p=0.95 vs p=0.5 on the §A13 grammar tutor¶

🇪🇸 No es un solo break — son tres configuraciones, cada una "rota" de una manera distinta. La práctica es: misma seed, mismo modelo, mismos prompts, tres valores de p. La diversidad colapsa o se desborda según p; observamos la frontera entre creatividad útil y ruido inservible.

Symptom Borja will see¶

Three generation runs with identical seed, model, and prompts (10 prompts from data/eval/probes.jsonl, each asking for a continuation). Only top_p differs:

Run A: top_p = 1.0 (no truncation).
Run B: top_p = 0.95 (the recommended).
Run C: top_p = 0.5 (aggressive truncation).

For prompt She has wri___ (canonical answer: written):

Run	Sample 1	Sample 2	Sample 3
A (p=1.0)	`written`	`writes`	`werder` (!)
B (p=0.95)	`written`	`written`	`writes`
C (p=0.5)	`written`	`written`	`written`

For prompt I want to ___ tomorrow (canonical: any infinitive verb from the set; multiple valid):

Run	Sample 1	Sample 2	Sample 3
A (p=1.0)	`play`	`study`	`qzz` (!)
B (p=0.95)	`play`	`work`	`study`
C (p=0.5)	`play`	`play`	`play`

The pattern: A admits garbage tokens; C admits no diversity; B threads the needle.

Then evaluate CCR (conjugation-correctness rate) over all 60 val probes:

Run	CCR	Diversity (unique outputs / prompt)
A	76%	8.4 / 10
B	88%	3.7 / 10
C	91%	1.0 / 10

A has high diversity but the garbage outputs drag CCR down. C has perfect collapse to greedy. B is the sweet spot — slightly lower CCR than C, but the diversity unlocks useful "the tutor can propose multiple plausible corrections" behavior that the grammar-tutor agent (Phase 32) needs.

The break, mechanically¶

In experiments/21-break-top-p/config.yaml:

runs:
  - name: A
    top_p: 1.0
  - name: B
    top_p: 0.95
  - name: C
    top_p: 0.5

Or directly in code, in src/miniinfer/sample.py:

def top_p_filter(logits, p):
    if p >= 1.0:
        return logits   # no filter
    # ... normal top-p ...

The "break" is configuration, not code. Each p value is broken in a different way:

\(p = 1.0\) is broken because it does no filtering.
\(p = 0.5\) is broken because it filters too aggressively.
\(p = 0.95\) is correct.

Why this teaches the concept¶

The grammar tutor (Phase 32) is not just a "predict the next token" model. It must:

Identify whether a sentence has a grammar error.
If yes, propose a correction.
Be able to propose multiple plausible alternatives when the correction is ambiguous (e.g., "I work yesterday" could be corrected to "I worked yesterday" or "I was working yesterday").

For (3), the tutor needs diversity. Greedy (or \(p = 0.5\) on a confident distribution) gives only one suggestion, and the tutor cannot offer alternatives. But for (1) and (2), the tutor needs correctness — the suggestion must be a legitimate English form, not a random BPE token.

The §A13 distribution makes this concrete. For She has wri___:

The top tokens by probability are: tten, tes, ting, te, t, tt, ...
tten completes to written (correct past participle).
tes completes to writes (correct present 3sg — but wrong for this prompt, because has makes it past participle).
ting completes to writing (present participle — wrong here, but valid English).
te and below are mostly noise.

\(p = 0.95\) keeps the top 4 (cumulative 0.9701). The tutor can propose written (correct), writes (close but wrong), writing (close but wrong) — and the learner learns from seeing the alternatives.

\(p = 1.0\) admits werder (a garbage BPE chunk) into the candidate set. The tutor sometimes suggests it. Now the user sees gibberish and loses trust.

\(p = 0.5\) keeps only tten. The tutor proposes only written. Correct, but pedagogically poorer.

Diagnostic ladder Borja should walk¶

First check: the diversity column. Run A is 8.4, run C is 1.0. Diversity should be a non-decreasing function of \(p\).
Second check: sample-level inspection. Read Run A's third samples. The garbage tokens (werder, qzz) are obvious. They're tail-token completions that \(p = 1.0\) admitted.
Third check: CCR. Run A's CCR is 76%, lower than the model's actual CCR on greedy (~91%). The garbage outputs are dragging it down.
Diagnosis: \(p = 1.0\) admits tail tokens that the model assigns small probability to (so the model "knows they're unlikely") but that the sampler picks once in a while. The fix is to cut the tail.

Reproducer¶

just phase-21-generate top_p=1.0  > experiments/21-A/
just phase-21-generate top_p=0.95 > experiments/21-B/
just phase-21-generate top_p=0.5  > experiments/21-C/

# Score
just phase-21-score experiments/21-A experiments/21-B experiments/21-C

Hint cascade¶

(Mild) "The three runs produce very different diversity numbers. What's the relationship between p and diversity?"
(Medium) "Run A occasionally produces tokens that look like garbage. Why might that happen even though the model assigns low probability to those tokens?"
(Direct) "Top-p with p = 1.0 does no filtering. The tail of the distribution gets sampled occasionally. What's the fix?"

Fix¶

Use \(p = 0.95\) for generation in the §A13 grammar tutor. Confirm CCR climbs back to ~88% and diversity stays at 3.7 (the productive middle).

For tasks where diversity should be lower (e.g., a confidence-bounded auto-complete), use \(p = 0.7\) or even \(p = 0.5\). For tasks where diversity should be higher (e.g., creative writing, story continuation), use \(p = 0.97\). The §A13 grammar tutor is in the middle.

What this break is NOT¶

Not a model bug — the model's softmax is identical across runs.
Not a tokenizer bug — the same tokens are produced.
Not a probability-math bug — top-p code is correct in all runs.

It's a configuration bug that demonstrates the sampling hyperparameter's behavioral effect. The pedagogical move: change one knob across three runs and let the dashboard show you what the knob does. Phase 21 is the cleanest place in the curriculum to do this — the model is fixed, only the sampler changes.

Cross-refs¶

theory/04-top-p-worked-example.md — the math.
theory/02-top-k-and-top-p.md — top-k variant.
Phase 32 — production usage of these settings.