English · Español

Phase 12 — Quiz (human-readable mirror)¶

🇪🇸 Espejo legible del canónico data/quizzes/phase-12-corpus-design.yaml.

Source: data/quizzes/phase-12-corpus-design.yaml.

q-12-01 — Why enumerated, not scraped? (single)¶

Smaller corpora always generalize better
Enumeration guarantees coverage, balance, and zero label noise — at this scale all three matter more than size ✓
Scraped corpora are illegal to use
Web text contains too many emoji

The §A13 task needs to memorize ~700 bits of facts and generalize one rule. An enumerated corpus gives full coverage and zero noise; scraped corpora trade those for size.

q-12-02 — What does a stratified split guarantee? (multi)¶

Each tense appears in both train and val sets ✓
Each person appears in both train and val sets ✓
No verb appears in both train and val (leakage prevention) ✓
The val set contains exactly the same examples as the train set

Stratification ensures balanced tense/person coverage and disjoint verbs across splits.

q-12-03 — Find the bug: train and val loss diverge (free)¶

A run shows train loss dropping to 0.05 and val loss stuck at exactly log(5) ≈ 1.61. Single most likely cause?

Expected to contain: label.

The val asymptote of log(K) is the entropy of uniform K-way classification (the random baseline). Train memorizes, val stays at random — labels carry no signal for val. Most often: shuffled labels.

q-12-04 — Memorization vs generalization regime (single)¶

Which does adding more distinct regular verbs help most?

Memorization of irregulars
Generalization of the regular rule ✓
Both equally
Neither — corpus size doesn't matter at this scale

More distinct regulars teach the model that -ed is a rule, not a fact about walk.

q-12-05 — Quality vs quantity at small scale (free)¶

Expected to contain: quality.

At microscopic scale, label noise dominates. 100 perfectly labeled examples beat 10,000 noisy ones for learning the rule. Coverage and balance also favor the curated set.