Skip to content

English · Español

Lab 03 — Vanishing gradient, empirically

Goal: measure the gradient decay through time on a vanilla RNN. Confirm theory file 03 with numbers from your own machine.

Estimated time: 60–90 minutes.

Prereq: lab 02 (RNN forward) committed.


What you produce

A directory experiments/14-vanishing-grad/ containing:

  • bptt_norm.py — compute the gradient norm at each timestep via BPTT for a synthetic length-50 sequence.
  • decay.json — gradient norm vs time step for vanilla RNN and GRU.
  • decay.png — log-y plot of gradient norm vs time.
  • manifest.json.
  • README.md (2–3 paragraphs).

The setup

Synthetic task: a length-50 token sequence. The model receives the sequence, runs forward, computes a loss at the final step only (e.g., cross-entropy against a target token), and then BPTT'd backwards. At each step \(t\), you record:

\[ g_t = \left\| \frac{\partial L_{50}}{\partial h_t} \right\|_2 \]

The plot of \(g_t\) vs \(t\) on a log y-axis is the deliverable. Expected shape: \(g_t\) decays from \(g_{50}\) (large) to \(g_0\) (tiny). The slope on log axes is the exponent of decay per step.

The synthetic sequence

You can use either:

  • Option A (cleaner): a long pronoun-verb-separator chain stretched out, e.g., repeated I work , you work , I work , you work , ... for 50 tokens. Target at position 50: the verb form that the first token (the original I) should agree with. The dependency is long-range by construction.
  • Option B (simpler): random token IDs. Target = a fixed token. The dependency is artificial but the gradient flow is still meaningful.

Both work for measuring vanishing. Option A is more on-topic with the verb-grammar theme; Option B is faster to set up.

TODOs

Block A — implement BPTT for vanilla RNN

You need the gradient of \(L_{50}\) with respect to each \(h_t\). The recurrence:

for t in range(1, T+1):
    z_t = W_hh @ h[t-1] + W_xh @ x[t] + b_h
    h[t] = np.tanh(z_t)

logits = W_ho @ h[T] + b_o
loss = cross_entropy(logits, target_id)

Backward:

dh[T] = (cross_entropy derivative w.r.t. h[T])
for t in range(T, 0, -1):
    # Backprop through tanh
    dz_t = dh[t] * (1 - h[t]**2)
    # Backprop through the recurrence
    dh[t-1] = W_hh.T @ dz_t
    g[t-1] = np.linalg.norm(dh[t-1])
  • Implement this in bptt_norm.py.
  • Use the same model parameters and seed as in lab 02 for reproducibility.
  • Run for \(T = 50\).
  • Record \(g[0], g[1], \ldots, g[49]\) in decay.json.

Block B — repeat for GRU

Compute the same gradient norms for a GRU. The backward is more involved because of the gates. Implementation hint: most of the gradient flows through the additive path \((1 - z_t) \odot h_{t-1}\). The Jacobian \(\partial h_t / \partial h_{t-1}\) for the GRU is:

\[ \frac{\partial h_t}{\partial h_{t-1}} \approx \text{diag}(1 - z_t) + (\text{small terms from } \tilde h_t) \]

If \(z_t\) is small, this is close to the identity — gradients flow through without contraction. This is the key empirical observation of the lab.

  • Implement the GRU backward in bptt_norm.py (or a separate file).
  • Same seed, same sequence.
  • Record \(g[0], \ldots, g[49]\).

Block C — plot

decay.png: - x-axis: time step \(t\) from 0 to 49. - y-axis: \(g_t\) on log scale. - Two lines: vanilla RNN (red) and GRU (blue). - Annotate: "vanilla RNN gradient decays by N orders of magnitude over 50 steps; GRU decays by M".

Compute N and M from your numbers; commit them in README.md.

Block D — interpret

In README.md, answer:

  1. Vanilla RNN decay rate. How many orders of magnitude per 10 steps? Compare to the theoretical prediction in theory/03-vanishing-gradient.md: with \(W_{hh}\) scaled around 0.1, the spectral radius is small, so the decay is fast. Cite your spectral radius estimate.
  2. GRU decay rate. How many orders per 10 steps? Should be substantially less than vanilla RNN — that's the additive path doing its job.
  3. Practical implication. If the model's gradient at \(t=0\) is, say, \(10^{-15}\), can the model learn anything about the early tokens? (No — that's below FP32 precision and indistinguishable from zero update.) State this as the practical fact: vanilla RNNs cannot learn long-range dependencies; this is not a hyperparameter problem.
  4. Why this motivates attention. Attention's gradient from a loss at position 50 to the input embedding at position 0 flows through one softmax weight, not 50 matrix multiplications. The gradient signal is preserved in one step. Phase 15 derives this.

Block E — manifest

{
  "experiment": "14-vanishing-grad",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "config": {
    "sequence_length": 50,
    "d_hidden": 32,
    "init_scale_W_hh": 0.1,
    "synthetic_target": "Option A"
  },
  "results_summary": {
    "rnn_orders_of_decay_per_10steps": null,
    "gru_orders_of_decay_per_10steps": null,
    "rnn_spectral_radius_W_hh": null
  },
  "versions": {"python": "3.11.x", "numpy": "X.Y.Z"}
}

Constraints

  • No training. Random-init models. We're measuring the raw gradient flow, not what training would do.
  • No PyTorch. Hand-implement BPTT. The hardest part of the lab — but mechanically illuminating.
  • Same seed for RNN and GRU. Otherwise the comparison is noise.

Stop conditions

Done when:

  1. decay.png shows two clearly-separated curves (RNN steeper than GRU).
  2. README.md answers the four interpretation questions.
  3. The orders-of-decay numbers are committed in manifest.json.
  4. You can point at the plot and explain why the RNN line is steeper.

Pitfalls

  • GRU decays just as fast as RNN. Probably \(z_t\) is close to 1 everywhere — the GRU is acting like a vanilla RNN. Check: print mean and std of \(z_t\) values during forward. If \(z_t > 0.9\) always, your init pushed the gates to saturation. Reduce init scale.
  • All gradient norms are zero. Probably tanh saturated and killed everything. Reduce init scale or check the forward for overflow.
  • All gradient norms are huge. Probably \(W_{hh}\) has spectral radius > 1. Use orthogonal init with scale 0.9 for a cleaner result.
  • Numerical underflow. \(g_0\) might be smaller than FP32 minimum (\(\sim 10^{-38}\)). If so, use FP64 (dtype=np.float64) and replace any zero with np.finfo(float).tiny for the log plot.

When to consult solutions/

After committing all the files. The solution at solutions/03-vanishing-empirical-ref.md discusses what to expect at different init scales and shows a reference plot.


Phase 14 lab work is complete. Next: /quiz 14, then PHASE_14_REPORT.md, then reflection, then proceed to Phase 15.