English · Español
Lab 03 — Diversity vs accuracy¶
🇪🇸 El compromiso central del muestreo: subir τ aumenta diversidad pero degrada corrección. Mide ambos sobre el corpus de conjugaciones de §A13 y dibuja la curva. La forma de "L" o "rodilla" es la firma del trade-off.
Objective¶
For a fixed set of prompts from the §A13 verb corpus, measure how output diversity and grammatical correctness trade off as the sampler parameter varies. Produce the "diversity vs correctness" curve that motivates picking a sampler in production.
Setup¶
- All four samplers from labs 00-02:
Greedy,Temperature,TopK,TopP. - The §A13 ground-truth conjugation table: a dict mapping
(verb, tense, person)→ expected English form. (You built this in Phase 12. If it's not yet a programmatic resource, write it asdata/verb_corpus_truth.jsonfor this lab.) - A list of 10 test prompts, one per (tense × person) combination. Examples:
"Yesterday I"→ expected: any past-tense form of a verb in our 20-verb set."Tomorrow she"→ expected: any future-tense (will / going to) form, 3rd-person singular."He"→ expected: any 3rd-person singular present form (with-s).
Tasks¶
- Define the grammaticality scorer.
def is_grammatical(prompt: str, completion: str, truth: dict) -> bool:
"""Returns True iff (prompt + completion) matches a valid §A13 conjugation."""
# ...
The check: split (prompt + completion) into (subject, verb-form); verify the verb-form's tense matches the prompt's expected tense and the person matches. Use the truth table.
For ambiguous prompts where multiple tenses are valid (e.g., bare "He"), accept any valid present-tense 3rd-person form.
- Define the diversity metric.
def diversity(completions: list[str]) -> float:
"""Fraction of unique completions: |unique| / |total|."""
return len(set(completions)) / len(completions)
- Sweep
Temperature(τ). For eachτ ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 3.0}: - For each of the 10 prompts:
- Generate 50 completions, one per seed.
- Compute the diversity (per prompt) and the grammaticality rate (per prompt).
- Average across prompts →
(diversity[τ], grammaticality[τ]). -
Plot a single point in (diversity, grammaticality) space.
-
Sweep
TopP(p)at fixedτ = 1.0. For eachp ∈ {0.5, 0.7, 0.9, 0.95, 0.99, 1.0}: - Same procedure as task 3.
-
Plot.
-
The composition. Try
Temperature(τ) ∘ TopP(p)for a small grid (e.g.,τ ∈ {0.7, 1.0},p ∈ {0.9, 0.95}). This is the recipe most production samplers use. -
Find the knee. Plot all the points on a single (diversity, grammaticality) axes. The "knee of the curve" — where grammaticality starts to drop sharply — is the sampler parameter you'd ship. For the Mini-GPT trained on the §A13 corpus, the knee is expected somewhere around
τ ≈ 1.0to1.5withp = 0.9. Annotate where you see it.
Measurements¶
Save to experiments/<date>-phase-21-diversity/:
diversity_vs_accuracy.png— the scatter plot, with each sampler config as a labeled point.temperature_curve.png— the slice alongTemperature(τ).topp_curve.png— the slice alongTopP(p).completions_sample.json— for each sampler config, save 5 example completions per prompt (for spot-checking).manifest.json— seeds, prompts, model checkpoint hash, exact sweep grid.
Acceptance¶
- The
Temperaturesweep produces a monotonically non-increasing grammaticality curve asτgrows (cooler = more correct). - The
Temperaturesweep produces a monotonically non-decreasing diversity curve asτgrows. Greedysits at the corner(diversity ≈ 0, grammaticality ≈ max)of the scatter plot. (If it isn't, your greedy decode is broken — go back to lab 00.)Temperature(τ=2.0)is visibly worse on grammaticality thanTemperature(τ=0.7). If it isn't, the model is so confident that temperature barely matters — investigate.- The PR / report includes a sentence picking one sampler config as the "production default" for the Phase 32 tutor, with a justification grounded in the plot.
Pitfalls¶
- Diversity over-counts trivial variation. "He works." vs "He works ." (extra space) are not really different. Normalise whitespace before computing the unique set.
- Grammaticality scoring is hard. The §A13 corpus is small enough that you can enumerate all valid forms — use the truth table, not heuristics. Heuristics will silently accept "He worked yesterday" (which is grammatical English but doesn't match the prompt-tense expectation).
- Sample size. 50 completions per (prompt, sampler) is barely enough — the diversity estimate has high variance for low
τ. If a result looks noisy, increase to 100 or 200. - Reading the curve wrong. Higher diversity AND higher grammaticality is not possible — they're on different axes. The knee is the trade-off point. Anything closer to the upper-right corner is better, but the upper-right corner itself is unachievable for a model with finite capacity.
- Don't overfit the sampler to this lab. The trade-off curve is prompt-dependent. Pick the sampler that wins broadly, not just on
"Tomorrow she".
Stretch¶
- Plot the same curve for
nucleus + temperaturecombined. Does the composition shift the knee? - Per-tense breakdown. Is the trade-off worse for past tense (more irregular forms) than for present tense (mostly regular
-s)?
Next: Write the PHASE_21_REPORT.md tying these results together. Then on to Phase 22 (KV cache).