English · Español
Lab 02 — RNN forward pass by hand¶
Goal: stare at a recurrence long enough to feel it. NumPy forward pass on
I work, you work, he ___— no training.Estimated time: 60–90 minutes.
Prereq: lab 00 (tokenized corpus), lab 01 (n-gram baseline) committed.
What you produce¶
A directory experiments/14-conjugation-completion/ containing:
rnn_forward.py— Vanilla RNN forward pass implementation (persrc/minimodel/sequence_baselines/rnn.pyblueprint).gru_forward.py— GRU forward pass.walkthrough.py— runs both on the canonical example, prints every hidden state.walkthrough.txt— the printed output, committed.hidden_state_evolution.png— visualization of \(\|h_t\|\) over time, optionally a heatmap of \(h_t\) values.manifest.json.README.md(2–3 paragraphs).
The example¶
The canonical Phase 14 example is:
You will run an untrained vanilla RNN and an untrained GRU on this sequence. No training. The point is to see the recurrence operate, not to get the right answer.
After the forward pass, you compute the logits at the final position (he) and read off the top-5 predicted tokens. For a random-init model, this should look random — not informative. That's expected. The lab is about mechanism, not accuracy.
TODOs¶
Block A — implement Vanilla RNN forward¶
Per src/minimodel/sequence_baselines/rnn.py blueprint:
class VanillaRNN:
def __init__(self, vocab_size, d_embed, d_hidden, seed=42):
# Initialize W_xh, W_hh, W_ho, b_h, b_o with small random values.
# Also: embedding matrix E[vocab_size, d_embed].
...
def forward(self, token_ids: list[int]) -> tuple[np.ndarray, list[np.ndarray]]:
# Returns (final_logits, all_hidden_states).
# all_hidden_states[t] is h_t after processing token_ids[t].
...
- Random init: \(W_{hh} \sim \mathcal{N}(0, 0.1)\), \(W_{xh} \sim \mathcal{N}(0, 0.1)\), \(W_{ho} \sim \mathcal{N}(0, 0.1)\). Biases zero.
- Embedding init: same scale.
- Forward pass: for each token, compute \(h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\). Record every \(h_t\).
- At the final step, compute \(\hat y = W_{ho} h_T + b_o\).
- Print the top-5 tokens by logit value.
Use \(d_\text{embed} = 16\), \(d_\text{hidden} = 32\). Lock these so multiple runs are comparable.
Block B — implement GRU forward¶
Per the blueprint, write class GRU with the same API. The forward pass uses the GRU recurrence:
Implementation notes: - \(W_z, W_r, W \in \mathbb{R}^{d_h \times (d_h + d_\text{embed})}\). - \(\sigma(x) = 1 / (1 + e^{-x})\); for stability, use \(\sigma(x) = e^x / (1 + e^x)\) when \(x < 0\). - Initialize biases \(b_z, b_r\) to zero (avoid forget-bias tricks here; that's a Phase 18 concern).
Block C — the walkthrough¶
walkthrough.py runs both models on the canonical example and prints:
=== Vanilla RNN forward on "I work , you work , he" ===
seed: 42
config: d_embed=16, d_hidden=32
t=0 token='<bos>' x_t=[0.05, -0.12, ...] h_t=[0.00, 0.00, ...] ||h_t||=0.00
t=1 token='I' x_t=[0.18, -0.05, ...] h_t=[0.04, 0.07, ...] ||h_t||=0.39
t=2 token='work' x_t=[0.07, 0.21, ...] h_t=[0.06, -0.02, ...] ||h_t||=0.44
...
t=8 token='he' x_t=[...] h_t=[...] ||h_t||=0.52
final_logits top-5:
rank=1 token='trabajaron' logit=0.31
rank=2 token='/' logit=0.27
...
=== GRU forward on same sequence ===
...
Commit the output as walkthrough.txt.
Block D — visualize state evolution¶
hidden_state_evolution.png: a plot with the x-axis as time step \(t\), and either:
- (A) a single curve of \(\|h_t\|_2\) over time (simpler), or
- (B) a heatmap of \(h_t\) values with time on x-axis and hidden-dim index on y-axis (richer).
Either is acceptable. Plot both vanilla RNN and GRU side-by-side.
Block E — interpret¶
In README.md, answer:
- Is the top-1 prediction at
hesemantically reasonable? For a random-init model: no, it's random. Confirm. - How does \(\|h_t\|\) evolve? Does it grow, shrink, or stabilize? For a randomly-initialized \(W_{hh}\) with \(\sigma = 0.1\), you'd expect \(\|h_t\|\) to plateau around a fixed value because tanh saturates the contributions. Confirm empirically.
- Does the GRU's \(h_t\) look different from the RNN's? Eyeball the heatmaps. Differences should be visible — the GRU's gating produces less-noisy state evolution.
- What information do you suspect is in \(h_8\)? It's untrained, so probably nothing useful. But conceptually: after seeing
I work, you work, he, an ideal \(h_8\) should encode "subject is 3rd-singular, tense is present-simple, expect verb agreement with-s". Note this aspirational reading.
Constraints¶
- No training. Random init, forward only. The point is mechanism, not accuracy.
- No PyTorch. NumPy + standard library only.
- Same seed across both models. Otherwise comparing them is meaningless.
- Lock \(d_\text{embed} = 16, d_\text{hidden} = 32\). Other choices are fine for personal exploration but the committed walkthrough uses these.
Stop conditions¶
Done when:
walkthrough.txtexists and prints every \(h_t\) for both models.hidden_state_evolution.pngshows the state norm or heatmap over time.README.mdanswers the four interpretation questions.- You can read your own
walkthrough.txtand explain what the model did at each step — even though the predictions are random.
Pitfalls¶
- All \(h_t\) are zero. Probably \(b_h = 0\), \(h_0 = 0\), and \(W_{hh}\) is tiny enough that \(\tanh(0 + \text{small}) \approx 0\). Either raise the init scale, or check your numpy operations.
- All \(h_t\) are saturated (close to \(\pm 1\)). Init scale too large. Reduce to \(\sigma = 0.05\) and re-run.
- GRU collapses to vanilla RNN. This happens when \(z_t \approx 1\) everywhere (then \(h_t \approx \tilde h_t\), which is the vanilla recurrence). Check \(z_t\) values — they should be in \([0.3, 0.7]\) initially with random init.
sigmoid(large)returns inf. Use the numerically stable form in Block B.
When to consult solutions/¶
After committing the files. Solution at solutions/02-rnn-by-hand-ref.md (written at phase open) compares your printed states and discusses what the trained model would have done.
Next lab: lab/03-vanishing-empirical.md.