Skip to content

English · Español

Lab 02 — RNN forward pass by hand

Goal: stare at a recurrence long enough to feel it. NumPy forward pass on I work, you work, he ___ — no training.

Estimated time: 60–90 minutes.

Prereq: lab 00 (tokenized corpus), lab 01 (n-gram baseline) committed.


What you produce

A directory experiments/14-conjugation-completion/ containing:

  • rnn_forward.py — Vanilla RNN forward pass implementation (per src/minimodel/sequence_baselines/rnn.py blueprint).
  • gru_forward.py — GRU forward pass.
  • walkthrough.py — runs both on the canonical example, prints every hidden state.
  • walkthrough.txt — the printed output, committed.
  • hidden_state_evolution.png — visualization of \(\|h_t\|\) over time, optionally a heatmap of \(h_t\) values.
  • manifest.json.
  • README.md (2–3 paragraphs).

The example

The canonical Phase 14 example is:

tokens: <bos> I work , you work , he
ids:    [...]                          (from lab 00 tokenization)

You will run an untrained vanilla RNN and an untrained GRU on this sequence. No training. The point is to see the recurrence operate, not to get the right answer.

After the forward pass, you compute the logits at the final position (he) and read off the top-5 predicted tokens. For a random-init model, this should look random — not informative. That's expected. The lab is about mechanism, not accuracy.

TODOs

Block A — implement Vanilla RNN forward

Per src/minimodel/sequence_baselines/rnn.py blueprint:

class VanillaRNN:
    def __init__(self, vocab_size, d_embed, d_hidden, seed=42):
        # Initialize W_xh, W_hh, W_ho, b_h, b_o with small random values.
        # Also: embedding matrix E[vocab_size, d_embed].
        ...

    def forward(self, token_ids: list[int]) -> tuple[np.ndarray, list[np.ndarray]]:
        # Returns (final_logits, all_hidden_states).
        # all_hidden_states[t] is h_t after processing token_ids[t].
        ...
  • Random init: \(W_{hh} \sim \mathcal{N}(0, 0.1)\), \(W_{xh} \sim \mathcal{N}(0, 0.1)\), \(W_{ho} \sim \mathcal{N}(0, 0.1)\). Biases zero.
  • Embedding init: same scale.
  • Forward pass: for each token, compute \(h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\). Record every \(h_t\).
  • At the final step, compute \(\hat y = W_{ho} h_T + b_o\).
  • Print the top-5 tokens by logit value.

Use \(d_\text{embed} = 16\), \(d_\text{hidden} = 32\). Lock these so multiple runs are comparable.

Block B — implement GRU forward

Per the blueprint, write class GRU with the same API. The forward pass uses the GRU recurrence:

\[ z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) \]
\[ r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) \]
\[ \tilde h_t = \tanh(W [r_t \odot h_{t-1}, x_t] + b) \]
\[ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t \]

Implementation notes: - \(W_z, W_r, W \in \mathbb{R}^{d_h \times (d_h + d_\text{embed})}\). - \(\sigma(x) = 1 / (1 + e^{-x})\); for stability, use \(\sigma(x) = e^x / (1 + e^x)\) when \(x < 0\). - Initialize biases \(b_z, b_r\) to zero (avoid forget-bias tricks here; that's a Phase 18 concern).

Block C — the walkthrough

walkthrough.py runs both models on the canonical example and prints:

=== Vanilla RNN forward on "I work , you work , he" ===
seed: 42
config: d_embed=16, d_hidden=32

t=0  token='<bos>'  x_t=[0.05, -0.12, ...]   h_t=[0.00, 0.00, ...]  ||h_t||=0.00
t=1  token='I'      x_t=[0.18, -0.05, ...]   h_t=[0.04, 0.07, ...]  ||h_t||=0.39
t=2  token='work'   x_t=[0.07, 0.21, ...]    h_t=[0.06, -0.02, ...]  ||h_t||=0.44
...
t=8  token='he'     x_t=[...]                h_t=[...]              ||h_t||=0.52

final_logits top-5:
  rank=1  token='trabajaron'  logit=0.31
  rank=2  token='/'           logit=0.27
  ...

=== GRU forward on same sequence ===
...

Commit the output as walkthrough.txt.

Block D — visualize state evolution

hidden_state_evolution.png: a plot with the x-axis as time step \(t\), and either:

  • (A) a single curve of \(\|h_t\|_2\) over time (simpler), or
  • (B) a heatmap of \(h_t\) values with time on x-axis and hidden-dim index on y-axis (richer).

Either is acceptable. Plot both vanilla RNN and GRU side-by-side.

Block E — interpret

In README.md, answer:

  1. Is the top-1 prediction at he semantically reasonable? For a random-init model: no, it's random. Confirm.
  2. How does \(\|h_t\|\) evolve? Does it grow, shrink, or stabilize? For a randomly-initialized \(W_{hh}\) with \(\sigma = 0.1\), you'd expect \(\|h_t\|\) to plateau around a fixed value because tanh saturates the contributions. Confirm empirically.
  3. Does the GRU's \(h_t\) look different from the RNN's? Eyeball the heatmaps. Differences should be visible — the GRU's gating produces less-noisy state evolution.
  4. What information do you suspect is in \(h_8\)? It's untrained, so probably nothing useful. But conceptually: after seeing I work, you work, he, an ideal \(h_8\) should encode "subject is 3rd-singular, tense is present-simple, expect verb agreement with -s". Note this aspirational reading.

Constraints

  • No training. Random init, forward only. The point is mechanism, not accuracy.
  • No PyTorch. NumPy + standard library only.
  • Same seed across both models. Otherwise comparing them is meaningless.
  • Lock \(d_\text{embed} = 16, d_\text{hidden} = 32\). Other choices are fine for personal exploration but the committed walkthrough uses these.

Stop conditions

Done when:

  1. walkthrough.txt exists and prints every \(h_t\) for both models.
  2. hidden_state_evolution.png shows the state norm or heatmap over time.
  3. README.md answers the four interpretation questions.
  4. You can read your own walkthrough.txt and explain what the model did at each step — even though the predictions are random.

Pitfalls

  • All \(h_t\) are zero. Probably \(b_h = 0\), \(h_0 = 0\), and \(W_{hh}\) is tiny enough that \(\tanh(0 + \text{small}) \approx 0\). Either raise the init scale, or check your numpy operations.
  • All \(h_t\) are saturated (close to \(\pm 1\)). Init scale too large. Reduce to \(\sigma = 0.05\) and re-run.
  • GRU collapses to vanilla RNN. This happens when \(z_t \approx 1\) everywhere (then \(h_t \approx \tilde h_t\), which is the vanilla recurrence). Check \(z_t\) values — they should be in \([0.3, 0.7]\) initially with random init.
  • sigmoid(large) returns inf. Use the numerically stable form in Block B.

When to consult solutions/

After committing the files. Solution at solutions/02-rnn-by-hand-ref.md (written at phase open) compares your printed states and discusses what the trained model would have done.


Next lab: lab/03-vanishing-empirical.md.