English · Español

Phase 14 — Pre-Transformer Sequence Models¶

Requires: 13 — Embeddings & Representation Spaces Teaches: n-gram · rnn · lstm · vanishing-gradient · perplexity-baselines Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12; pivoted to English verb grammar per A13. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Antes de attention, qué había. Aquí vivimos —brevemente— en el mundo de los n-gramas y las RNN. Lo justo para tener una baseline real de perplejidad sobre el corpus de conjugaciones de verbos ingleses (Fase 12) que el transformer de Fases 17–18 tendrá que superar, y para sentir en los dedos por qué el gradiente se desvanece a través del tiempo.

Goal¶

Establish the baseline that the Mini-GPT must beat (Phase 17/18) and the mechanical intuition for why recurrent models lost the architecture war. By the end of Phase 14:

A committed perplexity number for an n-gram LM on the Phase 12 English-verb-grammar corpus.
A hand-built NumPy RNN/GRU forward pass Borja has stared at long enough to feel the recurrence — applied to the canonical example I work, you work, he ___ (predict works).
A written, precise statement of the two failure modes of recurrent models on long contexts (gradient + serial compute) — the bridge to Phase 15.

Phase 14 is shallow on purpose. The spec line is "RNN, LSTM, GRU at the conceptual level (no need to fully train)". The depth lives in Phase 15 (attention) and Phase 18 (training loop).

Read order¶

theory/00-motivation.md — why this phase exists; what exactly it gives us that Phase 15 will then dismantle.
theory/01-ngram-models.md — MLE, smoothing, perplexity. The boring-but-real baseline on the verb-grammar corpus.
theory/02-rnn-recurrence.md — vanilla RNN, GRU, LSTM as a family of state-machines. Derived from first principles, illustrated on subject-pronoun → verb-form chains.
theory/03-vanishing-gradient.md — BPTT product-of-Jacobians; why eigenvalues drive vanish/explode; how LSTM/GRU patch (but don't solve) the problem.
lab/00-tokenize-corpus.md — wire up Phase 11/12 outputs as inputs to this phase.
lab/01-ngram-baseline.md — train and commit the perplexity number.
lab/02-rnn-by-hand.md — NumPy RNN forward on I work, you work, he ___. Print every hidden state. Rank the model's top-5 next-token predictions.
lab/03-vanishing-empirical.md — measure gradient decay over 50 steps on a long subject-verb-agreement chain; demonstrate the failure.

solutions/ is empty during pre-write — populated at phase open after Borja's prior-phase API decisions (Phase 11 tokenizer, Phase 12 corpus split) are visible.

Definition of Done¶

See PHASE_14_PLAN.md §6. Briefly:

N-gram perplexity number (3-gram and 5-gram) committed in experiments/14-ngram-baseline/manifest.json.
RNN forward pass committed; intermediate hidden states printed for the canonical conjugation-completion prompt.
Vanishing-gradient empirical decay plot committed.
Borja can articulate, in writing, the two distinct reasons RNNs lost to transformers (gradient + parallelism).

What this phase intentionally does NOT cover¶

Training RNNs. Spec is "no need to fully train". Forward-only.
Seq2seq. Encoder-decoder is implicit in Phase 15's cross-attention coverage, no separate phase.
Attention. That's Phase 15 — kept separate so the contrast is sharp.
PyTorch reference comparison. No PyTorch in Phase 14 (anti-goal §10). Borja's NumPy RNN is checked against itself, not against a framework.
Modern recurrent revivals (Mamba, S4, RWKV). Phase 36 (frontier architectures) territory.
Kneser-Ney smoothing. Mentioned in passing. Add-\(\alpha\) is sufficient for the baseline.
Plural persons (we, you-pl, they). Per A13, plurals are deferred. Phase 14 conjugation examples stay in 1^st-sg / 2^nd-sg / 3^rd-sg.

Phase 14's scope is the historical baseline and its mechanical failure mode. Nothing more — the goal is to make Phase 15's attention derivation land by knowing what it replaced.