Skip to content

English · Español

04 — Corpus size, quality, and the memorization-vs-generalization tradeoff

🇪🇸 ¿Cuántos ejemplos necesita el §A13? La intuición ingenua dice "más, mejor". La realidad para corpus pequeños y enumerados es opuesta: cuando el corpus es exhaustivo en el dominio (20 verbos × 5 tiempos × 3 personas = 300 formas), añadir ejemplos no añade información — solo cambia la presión de memorización. Esta página presenta la aritmética de cuándo memorizar es lo correcto, cuándo generalizar lo es, y dónde cae §A13.

Anchors: LYNX_CORTEX.md §4 / PHASE 12; LYNX_CORTEX_ADDENDUM.md §A13; Phase 19 (training dynamics); Phase 28 §1 SFT and catastrophic forgetting.


The basic question

Borja's §A13 corpus: 20 verbs × 5 tenses × 3 persons × 2 languages ≈ 600 (English, Spanish) form pairs, plus context sentences (~200), plus mis-conjugation examples for the audit/correction agent (~50). Total: ~850 sentences.

A modern LLM trains on ~10^12 tokens (Llama-2, ~2T tokens). The §A13 corpus is 10^9 times smaller by token count.

Should we be embarrassed? Or is this the right size?


Two regimes of model training

Regime Goal Corpus size needs to be
Memorization Store explicit facts ≥ # facts to be stored
Generalization Learn underlying structure ≥ # examples per "rule"

Most ML hype is about generalization. But many real-world tasks (lookup, FAQ, structured generation) want memorization — and they want it to be perfect.

The §A13 task sits between. The model must:

  • Memorize the irregular conjugations (be → was → been, go → went → gone).
  • Generalize the regular pattern (work → worked → worked, applied to any new regular verb).

This is exactly why the curriculum chose 12 regular + 8 irregular verbs.


The information-theoretic floor

Each conjugation is a 5-way classification (5 tenses). Entropy per example: log2(5) ≈ 2.32 bits. To fit 300 conjugations as memorization, the model needs ≥ 300 × 2.32 ≈ 700 bits of capacity in the right places.

A mini-GPT with ~50K parameters at fp32 has ~200K bits of nominal capacity. Plenty of room. The Phase 17 mini-GPT can memorize §A13 in 100 epochs trivially.

The question is whether it generalizes to the 60 regular conjugations not in the training set (we hold out 5 of the 12 regular verbs × 5 tenses × 3 persons = 75 pairs as the val set).


The "right size" calculation for §A13

Suppose the model needs N_ex examples per regular-verb rule to generalize. Empirically (Phase 19's train/val curve) we observe:

  • ≤ 50 examples per rule → memorizes train, fails val on unseen verbs.
  • 50–200 examples per rule → starts to generalize; val accuracy climbs.
  • ≥ 200 examples per rule → val plateau near 100% (rule is captured).

For §A13's regular verbs, each form gets 3 persons × 5 tenses = 15 contexts per verb. With 12 regular verbs in train = 12 × 15 = 180 regular examples. That's near the cliff. Phase 19 measures it.

Compare: a real-world tutoring corpus (300 verbs × 5 tenses × 3 persons × bilingual ≈ 9000 forms) has more forms but the same 15 examples per regular-rule. §A13 is operating in the same regime, just with fewer verbs. Borja can scale §A13 to 100 verbs in Phase 30+ if needed; the learning doesn't change.


When more data hurts: memorization-bound tasks

Counterintuitive but well-documented: for tasks where the model is meant to memorize, adding distractor examples (related but irrelevant data) degrades performance. Carlini et al. (2023) study this under the name "data poisoning by neutral examples".

For §A13, this means:

  • ✅ Adding more contexts of walk → walked (different sentences using the conjugation) helps.
  • ❌ Adding random tweets that happen to mention "walk" hurts: the model spreads its capacity across them.

Microscopic scope is a positive design choice, not a budget constraint.


When more data helps: generalization-bound tasks

For the regular-verb generalization, the model needs diversity of contexts. Adding more contexts per regular verb (different sentence patterns, different persons) genuinely helps. Adding more distinct regular verbs helps even more — it teaches the model that -ed is not specific to walk.

The §A13 corpus has 12 regular verbs. Whether that's enough is a Phase 19 empirical question. The lab there sweeps N_regular_verbs ∈ {4, 8, 12, 20} and measures val-set accuracy.


Quality vs quantity — the actual math

For a fixed budget of B training tokens, the optimal corpus has:

  • High coverage: every required form appears at least k times (k ≈ 3 empirically).
  • Low noise: every form is correctly labeled (mis-conjugation examples are labeled as such).
  • Balanced: each language gets the same number of tokens (within 10%).

A 1M-token web-scrape that mentions "walk" 50× and "walks" 2× and "walked" 200× is worse than a 1K-token enumerated corpus with each form 10×. Coverage wins.

Phase 12 §02 (leakage and splits) reinforces this: the corpus is enumerated, the split is stratified, the validation set is disjoint by verb-tense-person triple.


The §A13 numbers, concretely

Total forms (enumerated)               : 20 × 5 × 3 × 2 = 600 pairs
Context sentences (10/verb × 20 verbs) : 200 sentences
Mis-conjugation examples (Phase 20)    : 50 (correct vs incorrect pairs)
Total tokens (byte-level BPE, vocab=512): ~12,000 tokens

Train/val split                        : 80 / 20 stratified by triple
Train tokens                           : ~9,600
Val tokens                             : ~2,400

Train examples seen per epoch          : 240
Steps per epoch (batch size 8)         : 30
Total steps for convergence (Phase 19) : ~1,000  (~30 epochs)

Memory cost (fp32 mini-GPT at 50K params): 200KB weights
Disk cost (corpus)                       : 50KB
Compute cost (CPU only)                  : 5 min on i5-8250U

This is the curriculum's proof-of-concept budget. Phase 41 (portal) will measure if a student can complete the full 41-phase curriculum on a single laptop in 6 months. Phase 12's corpus size is sized to fit that budget.


What this doesn't address

  • Tail of natural language — real English has 30,000+ verbs. The §A13 model would fail on aboriginalize → aboriginalized. That's out of scope.
  • Multi-clause sentences — §A13 sentences are 5 words long. Phase 17's mini-GPT will not learn to handle "If I had walked, I would have arrived". This is deferred (no scope expansion without explicit approval, per CLAUDE.md §0.1).
  • Distributional shift — if a student uses the portal with their own sentences in Phase 41, those sentences may have novel structures. Phase 38 (MLOps) addresses drift detection.

Citations

  • Carlini, N. et al. 2023. "Quantifying Memorization Across Neural Language Models." arXiv:2202.07646 — distinguishes memorization from generalization quantitatively.
  • Hoffmann, J. et al. 2022. "Training Compute-Optimal Large Language Models" (Chinchilla). The ratio tokens : parameters ≈ 20 : 1 is for generalization. §A13 deliberately operates outside this regime (memorization-heavy, microscopic scope) — see Extension Track X1.
  • Kaplan, J. et al. 2020. "Scaling Laws for Neural Language Models." arXiv:2001.08361.

One-paragraph recap

§A13's corpus is microscopic (~850 sentences, ~12K tokens) by design. Tasks split into memorization (irregular verbs — model must store ~700 bits of explicit facts) and generalization (regular verbs — model needs ~15 examples per rule to capture the -ed pattern). Adding more data hurts memorization-bound subtasks (distractor poisoning) but helps generalization-bound subtasks (coverage). Enumerated, balanced, low-noise corpora at this scale outperform 1000× larger scraped corpora for §A13's specific task shape. Phase 19 measures the train/val cliff empirically; Phase 30 (or X1) extends if Borja wants to scale.


Next: Phase 13 (embeddings).