English · Español

Lab 03 — Diversity vs accuracy¶

🇪🇸 El compromiso central del muestreo: subir τ aumenta diversidad pero degrada corrección. Mide ambos sobre el corpus de conjugaciones de §A13 y dibuja la curva. La forma de "L" o "rodilla" es la firma del trade-off.

Objective¶

For a fixed set of prompts from the §A13 verb corpus, measure how output diversity and grammatical correctness trade off as the sampler parameter varies. Produce the "diversity vs correctness" curve that motivates picking a sampler in production.

Setup¶

All four samplers from labs 00-02: Greedy, Temperature, TopK, TopP.
The §A13 ground-truth conjugation table: a dict mapping (verb, tense, person) → expected English form. (You built this in Phase 12. If it's not yet a programmatic resource, write it as data/verb_corpus_truth.json for this lab.)
A list of 10 test prompts, one per (tense × person) combination. Examples:
"Yesterday I" → expected: any past-tense form of a verb in our 20-verb set.
"Tomorrow she" → expected: any future-tense (will / going to) form, 3^rd-person singular.
"He" → expected: any 3^rd-person singular present form (with -s).

Tasks¶

Define the grammaticality scorer.

def is_grammatical(prompt: str, completion: str, truth: dict) -> bool:
    """Returns True iff (prompt + completion) matches a valid §A13 conjugation."""
    # ...

The check: split (prompt + completion) into (subject, verb-form); verify the verb-form's tense matches the prompt's expected tense and the person matches. Use the truth table.

For ambiguous prompts where multiple tenses are valid (e.g., bare "He"), accept any valid present-tense 3^rd-person form.

Define the diversity metric.

def diversity(completions: list[str]) -> float:
    """Fraction of unique completions: |unique| / |total|."""
    return len(set(completions)) / len(completions)

Sweep Temperature(τ). For each τ ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 3.0}:
For each of the 10 prompts:
- Generate 50 completions, one per seed.
- Compute the diversity (per prompt) and the grammaticality rate (per prompt).
Average across prompts → (diversity[τ], grammaticality[τ]).
Plot a single point in (diversity, grammaticality) space.
Sweep TopP(p) at fixed τ = 1.0. For each p ∈ {0.5, 0.7, 0.9, 0.95, 0.99, 1.0}:
Same procedure as task 3.
Plot.
The composition. Try Temperature(τ) ∘ TopP(p) for a small grid (e.g., τ ∈ {0.7, 1.0}, p ∈ {0.9, 0.95}). This is the recipe most production samplers use.
Find the knee. Plot all the points on a single (diversity, grammaticality) axes. The "knee of the curve" — where grammaticality starts to drop sharply — is the sampler parameter you'd ship. For the Mini-GPT trained on the §A13 corpus, the knee is expected somewhere around τ ≈ 1.0 to 1.5 with p = 0.9. Annotate where you see it.

Measurements¶

Save to experiments/<date>-phase-21-diversity/:

diversity_vs_accuracy.png — the scatter plot, with each sampler config as a labeled point.
temperature_curve.png — the slice along Temperature(τ).
topp_curve.png — the slice along TopP(p).
completions_sample.json — for each sampler config, save 5 example completions per prompt (for spot-checking).
manifest.json — seeds, prompts, model checkpoint hash, exact sweep grid.

Acceptance¶

The Temperature sweep produces a monotonically non-increasing grammaticality curve as τ grows (cooler = more correct).
The Temperature sweep produces a monotonically non-decreasing diversity curve as τ grows.
Greedy sits at the corner (diversity ≈ 0, grammaticality ≈ max) of the scatter plot. (If it isn't, your greedy decode is broken — go back to lab 00.)
Temperature(τ=2.0) is visibly worse on grammaticality than Temperature(τ=0.7). If it isn't, the model is so confident that temperature barely matters — investigate.
The PR / report includes a sentence picking one sampler config as the "production default" for the Phase 32 tutor, with a justification grounded in the plot.

Pitfalls¶

Diversity over-counts trivial variation. "He works." vs "He works ." (extra space) are not really different. Normalise whitespace before computing the unique set.
Grammaticality scoring is hard. The §A13 corpus is small enough that you can enumerate all valid forms — use the truth table, not heuristics. Heuristics will silently accept "He worked yesterday" (which is grammatical English but doesn't match the prompt-tense expectation).
Sample size. 50 completions per (prompt, sampler) is barely enough — the diversity estimate has high variance for low τ. If a result looks noisy, increase to 100 or 200.
Reading the curve wrong. Higher diversity AND higher grammaticality is not possible — they're on different axes. The knee is the trade-off point. Anything closer to the upper-right corner is better, but the upper-right corner itself is unachievable for a model with finite capacity.
Don't overfit the sampler to this lab. The trade-off curve is prompt-dependent. Pick the sampler that wins broadly, not just on "Tomorrow she".

Stretch¶

Plot the same curve for nucleus + temperature combined. Does the composition shift the knee?
Per-tense breakdown. Is the trade-off worse for past tense (more irregular forms) than for present tense (mostly regular -s)?

Next: Write the PHASE_21_REPORT.md tying these results together. Then on to Phase 22 (KV cache).