English · Español
Lab 03 — Vanishing gradient, empirically¶
Goal: measure the gradient decay through time on a vanilla RNN. Confirm theory file 03 with numbers from your own machine.
Estimated time: 60–90 minutes.
Prereq: lab 02 (RNN forward) committed.
What you produce¶
A directory experiments/14-vanishing-grad/ containing:
bptt_norm.py— compute the gradient norm at each timestep via BPTT for a synthetic length-50 sequence.decay.json— gradient norm vs time step for vanilla RNN and GRU.decay.png— log-y plot of gradient norm vs time.manifest.json.README.md(2–3 paragraphs).
The setup¶
Synthetic task: a length-50 token sequence. The model receives the sequence, runs forward, computes a loss at the final step only (e.g., cross-entropy against a target token), and then BPTT'd backwards. At each step \(t\), you record:
The plot of \(g_t\) vs \(t\) on a log y-axis is the deliverable. Expected shape: \(g_t\) decays from \(g_{50}\) (large) to \(g_0\) (tiny). The slope on log axes is the exponent of decay per step.
The synthetic sequence¶
You can use either:
- Option A (cleaner): a long pronoun-verb-separator chain stretched out, e.g., repeated
I work , you work , I work , you work , ...for 50 tokens. Target at position 50: the verb form that the first token (the originalI) should agree with. The dependency is long-range by construction. - Option B (simpler): random token IDs. Target = a fixed token. The dependency is artificial but the gradient flow is still meaningful.
Both work for measuring vanishing. Option A is more on-topic with the verb-grammar theme; Option B is faster to set up.
TODOs¶
Block A — implement BPTT for vanilla RNN¶
You need the gradient of \(L_{50}\) with respect to each \(h_t\). The recurrence:
for t in range(1, T+1):
z_t = W_hh @ h[t-1] + W_xh @ x[t] + b_h
h[t] = np.tanh(z_t)
logits = W_ho @ h[T] + b_o
loss = cross_entropy(logits, target_id)
Backward:
dh[T] = (cross_entropy derivative w.r.t. h[T])
for t in range(T, 0, -1):
# Backprop through tanh
dz_t = dh[t] * (1 - h[t]**2)
# Backprop through the recurrence
dh[t-1] = W_hh.T @ dz_t
g[t-1] = np.linalg.norm(dh[t-1])
- Implement this in
bptt_norm.py. - Use the same model parameters and seed as in lab 02 for reproducibility.
- Run for \(T = 50\).
- Record \(g[0], g[1], \ldots, g[49]\) in
decay.json.
Block B — repeat for GRU¶
Compute the same gradient norms for a GRU. The backward is more involved because of the gates. Implementation hint: most of the gradient flows through the additive path \((1 - z_t) \odot h_{t-1}\). The Jacobian \(\partial h_t / \partial h_{t-1}\) for the GRU is:
If \(z_t\) is small, this is close to the identity — gradients flow through without contraction. This is the key empirical observation of the lab.
- Implement the GRU backward in
bptt_norm.py(or a separate file). - Same seed, same sequence.
- Record \(g[0], \ldots, g[49]\).
Block C — plot¶
decay.png:
- x-axis: time step \(t\) from 0 to 49.
- y-axis: \(g_t\) on log scale.
- Two lines: vanilla RNN (red) and GRU (blue).
- Annotate: "vanilla RNN gradient decays by N orders of magnitude over 50 steps; GRU decays by M".
Compute N and M from your numbers; commit them in README.md.
Block D — interpret¶
In README.md, answer:
- Vanilla RNN decay rate. How many orders of magnitude per 10 steps? Compare to the theoretical prediction in
theory/03-vanishing-gradient.md: with \(W_{hh}\) scaled around 0.1, the spectral radius is small, so the decay is fast. Cite your spectral radius estimate. - GRU decay rate. How many orders per 10 steps? Should be substantially less than vanilla RNN — that's the additive path doing its job.
- Practical implication. If the model's gradient at \(t=0\) is, say, \(10^{-15}\), can the model learn anything about the early tokens? (No — that's below FP32 precision and indistinguishable from zero update.) State this as the practical fact: vanilla RNNs cannot learn long-range dependencies; this is not a hyperparameter problem.
- Why this motivates attention. Attention's gradient from a loss at position 50 to the input embedding at position 0 flows through one softmax weight, not 50 matrix multiplications. The gradient signal is preserved in one step. Phase 15 derives this.
Block E — manifest¶
{
"experiment": "14-vanishing-grad",
"date": "YYYY-MM-DD",
"seed": 42,
"config": {
"sequence_length": 50,
"d_hidden": 32,
"init_scale_W_hh": 0.1,
"synthetic_target": "Option A"
},
"results_summary": {
"rnn_orders_of_decay_per_10steps": null,
"gru_orders_of_decay_per_10steps": null,
"rnn_spectral_radius_W_hh": null
},
"versions": {"python": "3.11.x", "numpy": "X.Y.Z"}
}
Constraints¶
- No training. Random-init models. We're measuring the raw gradient flow, not what training would do.
- No PyTorch. Hand-implement BPTT. The hardest part of the lab — but mechanically illuminating.
- Same seed for RNN and GRU. Otherwise the comparison is noise.
Stop conditions¶
Done when:
decay.pngshows two clearly-separated curves (RNN steeper than GRU).README.mdanswers the four interpretation questions.- The orders-of-decay numbers are committed in
manifest.json. - You can point at the plot and explain why the RNN line is steeper.
Pitfalls¶
- GRU decays just as fast as RNN. Probably \(z_t\) is close to 1 everywhere — the GRU is acting like a vanilla RNN. Check: print mean and std of \(z_t\) values during forward. If \(z_t > 0.9\) always, your init pushed the gates to saturation. Reduce init scale.
- All gradient norms are zero. Probably tanh saturated and killed everything. Reduce init scale or check the forward for overflow.
- All gradient norms are huge. Probably \(W_{hh}\) has spectral radius > 1. Use orthogonal init with scale 0.9 for a cleaner result.
- Numerical underflow. \(g_0\) might be smaller than FP32 minimum (\(\sim 10^{-38}\)). If so, use FP64 (
dtype=np.float64) and replace any zero withnp.finfo(float).tinyfor the log plot.
When to consult solutions/¶
After committing all the files. The solution at solutions/03-vanishing-empirical-ref.md discusses what to expect at different init scales and shows a reference plot.
Phase 14 lab work is complete. Next: /quiz 14, then PHASE_14_REPORT.md, then reflection, then proceed to Phase 15.