Skip to content

English · Español

Phase 12 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del canónico data/quizzes/phase-12-corpus-design.yaml.

Source: data/quizzes/phase-12-corpus-design.yaml.


q-12-01 — Why enumerated, not scraped? (single)

  • Smaller corpora always generalize better
  • Enumeration guarantees coverage, balance, and zero label noise — at this scale all three matter more than size
  • Scraped corpora are illegal to use
  • Web text contains too many emoji

The §A13 task needs to memorize ~700 bits of facts and generalize one rule. An enumerated corpus gives full coverage and zero noise; scraped corpora trade those for size.


q-12-02 — What does a stratified split guarantee? (multi)

  • Each tense appears in both train and val sets
  • Each person appears in both train and val sets
  • No verb appears in both train and val (leakage prevention)
  • The val set contains exactly the same examples as the train set

Stratification ensures balanced tense/person coverage and disjoint verbs across splits.


q-12-03 — Find the bug: train and val loss diverge (free)

A run shows train loss dropping to 0.05 and val loss stuck at exactly log(5) ≈ 1.61. Single most likely cause?

Expected to contain: label.

The val asymptote of log(K) is the entropy of uniform K-way classification (the random baseline). Train memorizes, val stays at random — labels carry no signal for val. Most often: shuffled labels.


q-12-04 — Memorization vs generalization regime (single)

Which does adding more distinct regular verbs help most?

  • Memorization of irregulars
  • Generalization of the regular rule
  • Both equally
  • Neither — corpus size doesn't matter at this scale

More distinct regulars teach the model that -ed is a rule, not a fact about walk.


q-12-05 — Quality vs quantity at small scale (free)

Expected to contain: quality.

At microscopic scale, label noise dominates. 100 perfectly labeled examples beat 10,000 noisy ones for learning the rule. Coverage and balance also favor the curated set.