Skip to content

English · Español

00 — Why pre-transformer sequence models, at all

🇪🇸 La fase 14 es deliberadamente corta. Existe por dos razones: para fijar una baseline numérica honesta (la perplejidad de un n-gram sobre el corpus de conjugaciones), y para que en la Fase 15 derivemos attention contra algo, no en el vacío. Sin esta fase, el transformer es magia. Con esta fase, es una respuesta a problemas concretos.


The position of Phase 14 in the arc

You arrive at Phase 14 with: a hand-built BPE tokenizer (Phase 11), a curated corpus of English verb conjugations + Spanish translations (Phase 12), and learned embeddings of every token in that corpus (Phase 13). What you do not have, yet, is a language model — a probability distribution over the next token given the prior tokens.

Phase 15 will build a language model out of attention. Phase 17 will assemble that into a Mini-GPT. Phase 18 will train it.

Before any of that, two things must exist that don't yet:

  1. A perplexity baseline. Without one, "Phase 18 trained the model to perplexity X" is a number with no context. Is X good? Is X bad? You can't say without a baseline.
  2. A mechanical intuition for why attention exists at all. Attention is presented in every textbook as a fait accompli. The math falls out of the sky. Then it works. Phase 14 lives in the world it replaced, so by Phase 15 the math feels like an answer, not a definition.

That's it. Phase 14 doesn't try to be a course on RNNs. It builds the baseline number and the mechanical priors, and then steps aside.

The corpus, briefly

The Phase 12 corpus is English verb conjugations with paired Spanish translations, in a fixed grammar:

  • 20 verbs: 12 regular (work, play, walk, talk, listen, watch, study, finish, start, look, want, like) and 8 irregular (be, have, do, go, come, see, eat, write).
  • 5 tenses: infinitive, present simple, past simple, past participle, simple future.
  • 3 persons: 1st-sg (I), 2nd-sg (you), 3rd-sg (he/she/it).
  • Every English form is paired with its Spanish translation in a deterministic format.

Examples of corpus rows:

I work / yo trabajo
you work / tú trabajas
he works / él trabaja
I worked / yo trabajé
he will work / él trabajará
he is going to work / él va a trabajar

The whole corpus is about 600 form-pairs. Genuinely microscopic. Every token is one of: a pronoun, a verb stem, an inflection suffix (-s, -ed, -ing), an auxiliary (will, is, going, to), or a punctuation/separator token. The vocabulary is small enough that you can fit it on a single page.

The canonical Phase 14 task is conjugation completion: given I work, you work, he ?, the model should rank works higher than work, worked, or working. This is a deterministic mapping in our corpus — there is exactly one right answer for each prompt. That makes evaluation crisp.

Why an n-gram is a real baseline here, not a toy

For most natural-language corpora, n-grams are a comically weak baseline — they have no notion of long-range structure, semantic similarity, or compositionality. They are toys.

For our verb-grammar corpus, an n-gram is the right tool for most of the task. Subject-pronoun → verb-form is a local dependency: the verb form depends on the pronoun immediately preceding it, separated by zero or one auxiliary tokens. A 3-gram or 4-gram window covers it. Add-\(\alpha\) smoothing handles unseen prefixes. The model is dumb but the task is local.

This is a feature, not a bug. It means:

  • The Mini-GPT (Phase 17) has to earn its win. It can't beat a 5-gram just by existing.
  • The baseline number you commit at the end of Phase 14 is a hard target, not a strawman.
  • When the transformer eventually pulls ahead, you'll know it's not because the baseline was rigged.

If the n-gram beats the transformer on this corpus, that's information — probably it means the transformer needs more training data, a bigger context, or a different objective. The point of building both is to know which.

Why we still need to leave n-grams behind

Three reasons attention is going to demolish n-grams eventually, even though they're competitive on this corpus:

  1. Generalization across paradigms. An n-gram that learned you work does not generalize that you play, you walk, you talk have the same shape. It treats each (pronoun, verb) bigram as independent. A model with shared embeddings (Phase 13) and attention (Phase 15) does generalize — the you embedding sits in one place and routes the same way to every verb.
  2. Tense interactions. The simple future going to + infinitive is a 3-token dependency: he is going to work. An n-gram with window \(n < 4\) misses it. Phase 14's lab demonstrates exactly when n-grams break.
  3. Bilingual alignment. The Spanish side has its own conjugation logic (trabajo / trabajas / trabaja) that has to be aligned with the English side (work / work / works) without leakage. An n-gram learns no alignment — it just memorizes co-occurrence. Attention will learn this alignment as a side effect of its routing.

These three reasons are not visible until you've actually built and measured an n-gram. Hence Phase 14.

What we're going to feel by the end of the phase

Two concrete sensations:

  1. Hidden state as a bottleneck. When you watch a NumPy RNN forward pass on ["I", "work", ",", "you", "work", ",", "he", "?"] and you can read every \(h_t\) as an array of 16 or 32 floats, you'll see that all the information about the prior tokens has to fit in that fixed-size vector. The pronoun I happened 7 tokens ago; whatever the model knows about it is in \(h_7\) if it survived the contractions of \(W_{hh}\). Mostly it didn't.
  2. Gradient decay through time. When you compute the gradient norm at each step of a 50-step BPTT and plot it on a log axis, you'll see the line go vertical. It's not gradual; it's catastrophic. After 20 steps, the gradient from the loss to the early tokens is gone. The model literally cannot learn long-range dependencies. LSTMs and GRUs flatten the curve but don't fix the underlying problem.

If both of those sensations land, Phase 14 has done its job. The rest is bookkeeping (write the number, commit the plot, move on).

What this phase is NOT going to do

For clarity, the following are out of scope in Phase 14:

  • Train the RNN/GRU/LSTM. Forward pass only. Spec is "no need to fully train".
  • Compare to PyTorch. Anti-goal §10 — no PyTorch until Phase 24. Borja's NumPy RNN is checked against itself.
  • Implement Mamba, S4, RWKV, or any modern recurrent revival. That's Phase 36 (frontier architectures) territory.
  • Cover seq2seq, encoder-decoder, or cross-attention beyond a one-line note. The curriculum is decoder-only.
  • Cover Kneser-Ney smoothing in depth. Add-\(\alpha\) is the default; KN is a 2-paragraph aside in theory/01-ngram-models.md.
  • Cover plural persons. Per A13, we / you-pl / they / they-fem / they-pl are deferred. Examples in Phase 14 stay in 1st-sg / 2nd-sg / 3rd-sg.

The path through Phase 14

  • Theory 01 does n-grams in detail enough that you can compute the baseline number by hand on a 20-token toy corpus.
  • Theory 02 derives the vanilla RNN, then the GRU, then sketches the LSTM. The framing is: each one is a state-machine, and each one is a patch on the previous.
  • Theory 03 is the BPTT derivation. It's where the vanishing/exploding gradient comes from.
  • Labs 00–03 implement everything on the verb-grammar corpus.

Stop here if

You're tempted to skip Phase 14 because "I'll just train the Mini-GPT directly". Don't. The Mini-GPT will either (a) beat the n-gram and you won't know by how much, or (b) lose to it and you won't know why. Phase 14 is a 10-15 hour investment that pays off across every later evaluation.

🇪🇸 La intuición que necesitas al salir de la fase: "los modelos recurrentes pueden funcionar para tareas locales como nuestras conjugaciones, pero su mecanismo (un estado oculto de tamaño fijo + multiplicaciones encadenadas en el tiempo) los hace incapaces de escalar". Attention es la respuesta a ambos problemas en una sola idea.


Next: theory/01-ngram-models.md.