English · Español

03 — Vanishing and exploding gradients through time¶

🇪🇸 Si en una RNN el gradiente se desvanece, no es magia: es álgebra. Al hacer BPTT multiplicas \(t\) veces la misma matriz Jacobiana, y los autovalores de esa matriz controlan si el resultado se hace cero o infinito. LSTM/GRU "arreglan" esto poniendo un camino aditivo donde la multiplicación repetida no ocurre.

This file derives, with no hand-waving, why vanilla RNNs cannot learn long-range dependencies, and why LSTM/GRU's additive path is the structural fix.

The setup¶

We have a vanilla RNN trained by gradient descent on a sequence loss:

\[ L = \sum_{t=1}^{T} L_t(h_t, y_t) \]

where \(L_t\) is the per-token loss (cross-entropy between the predicted distribution \(\hat y_t\) and the true next token \(y_t\)).

Forward dynamics:

\[ h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) \]

To train, we need \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta \in \{W_{hh}, W_{xh}, W_{ho}, b_h, b_o\}\). The gradient with respect to \(W_{hh}\) is the interesting one — it determines whether the model can learn long-range dependencies.

BPTT: the chain rule unrolled¶

The gradient of the total loss with respect to a hidden state \(h_t\) flows from every later step that depends on \(h_t\). Concretely:

\[ \frac{\partial L}{\partial h_t} = \sum_{k \geq t} \frac{\partial L_k}{\partial h_t} \]

Each summand decomposes via the chain rule through every intermediate hidden state:

\[ \frac{\partial L_k}{\partial h_t} = \frac{\partial L_k}{\partial h_k} \cdot \prod_{j=t+1}^{k} \frac{\partial h_j}{\partial h_{j-1}} \]

That product over \(j\) is the crucial bit. We have:

\[ h_j = \tanh(W_{hh} h_{j-1} + W_{xh} x_j + b_h) = \tanh(z_j) \quad \text{where } z_j = W_{hh} h_{j-1} + W_{xh} x_j + b_h \]

So:

\[ \frac{\partial h_j}{\partial h_{j-1}} = \text{diag}(1 - \tanh^2(z_j)) \cdot W_{hh} \]

(Since \(\frac{d}{dz} \tanh(z) = 1 - \tanh^2(z)\), and we apply elementwise.)

Now the product becomes:

\[ \prod_{j=t+1}^{k} \frac{\partial h_j}{\partial h_{j-1}} = \prod_{j=t+1}^{k} \left[ \text{diag}(1 - \tanh^2(z_j)) \cdot W_{hh} \right] \]

This is the product of \((k - t)\) matrices, each containing one application of \(W_{hh}\). The product of \(W_{hh}^{k-t}\) (modulated by the diagonal \(\tanh\)-derivative terms) is the time-unrolled Jacobian.

Why this product vanishes or explodes¶

Linear-algebra fact: when you multiply a vector by \(W_{hh}\) many times, the result is dominated by the largest-magnitude eigenvalue of \(W_{hh}\). Formally, if \(\lambda_\text{max}\) is the spectral radius of \(W_{hh}\), then \(\|W_{hh}^k v\| \sim |\lambda_\text{max}|^k \|v\|\) for generic \(v\) and large \(k\).

Three cases:

\(|\lambda_\text{max}| < 1\): the product \(W_{hh}^k\) contracts geometrically toward zero. After \(k = 20\) applications with \(|\lambda_\text{max}| = 0.7\), the norm is \(0.7^{20} \approx 8 \times 10^{-4}\). The gradient from token 20 back to token 0 is essentially zero. The model cannot learn long-range dependencies — gradient updates to the relevant weights are negligible.
\(|\lambda_\text{max}| > 1\): the product grows geometrically. After 20 steps with \(|\lambda_\text{max}| = 1.5\), the norm is \(1.5^{20} \approx 3300\). The gradient explodes; training is unstable; weights diverge to NaN unless explicitly clipped.
\(|\lambda_\text{max}| = 1\): marginal case. Stable in the linear regime; the tanh nonlinearity still causes problems because \(1 - \tanh^2(z) \leq 1\), and when \(|z|\) is large (saturated tanh), the derivative collapses to ~0, which contracts the gradient anyway.

The "well-conditioned" choice \(|\lambda_\text{max}| = 1\) exists in initialization: orthogonal initialization of \(W_{hh}\) gives \(|\lambda_\text{max}| = 1\). But training drives the weights away from this initialization. There is no stable point.

The tanh-saturation contribution¶

Even if \(W_{hh}\) has \(|\lambda_\text{max}| = 1\), the diagonal factor \(\text{diag}(1 - \tanh^2(z_j))\) contracts the gradient whenever \(|z_j|\) is large.

\(\tanh(0) = 0\), so the derivative at zero is \(1\).
\(\tanh(\pm 3) \approx \pm 0.995\), so the derivative there is \(1 - 0.99 = 0.01\).

If the pre-activation \(z_j\) ever pushes any hidden unit into the saturated regime, that unit's gradient becomes near-zero for that timestep. Once the gradient has been killed at any step in the product, it cannot recover. Tanh saturation is a one-way trap.

This is why ReLU (which has derivative \(1\) for \(z > 0\)) is sometimes used in RNNs — its derivative doesn't saturate on one side. But ReLU has its own pathologies (dead units, no contraction when \(|\lambda_\text{max}| > 1\)), and tanh remains the default in standard RNN formulations.

What this means for our verb-grammar corpus¶

Our sequences are short (6–12 tokens), so vanishing/exploding gradients are not a serious problem within a single row. A vanilla RNN trained on I work, you work, he ___ should converge fine because the gradient only has to flow back ~8 steps.

But Phase 14's lab 03 demonstrates the failure on a stretched version of the task: an artificial sequence of length 50 (perhaps a long chain of pronoun-verb-separator triples, or a contrived sequence where the answer depends on a token 30 steps back). On that stretched task, the vanishing gradient is visible empirically — the gradient norm at step 30 is orders of magnitude smaller than at step 1, and the model cannot learn the long-range dependency.

This is the empirical bridge to Phase 15. Vanilla RNNs are fine for our corpus (short sequences). They fail catastrophically when the dependency is long. Attention's architectural promise is "no matter how far back the relevant token is, the gradient flows in one step" — and we'll derive why in Phase 15.

The GRU/LSTM patch: the additive path¶

The GRU recurrence:

\[ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t \]

When \(z_t \approx 0\), this is \(h_t \approx h_{t-1}\). The state at time \(t\) is essentially the state at time \(t - 1\), with no matrix multiplication in between.

This means the Jacobian \(\frac{\partial h_t}{\partial h_{t-1}} \approx I\) (the identity matrix), not \(\text{diag}(\ldots) \cdot W_{hh}\). Multiplying by \(I\) many times is just \(I\) — no contraction, no expansion.

Of course, \(z_t\) is not exactly zero everywhere — the gates are trainable, and the GRU does update its state when it should. But the option to skip the multiplication is built into the architecture. The model can learn, on a per-step basis, to let gradients pass through cleanly.

LSTM does the same thing with its cell state \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\). When \(f_t \approx 1\) and \(i_t \approx 0\), the cell state is preserved identically. The gradient flows back through the \(f_t \odot c_{t-1}\) path without matrix multiplication.

Why this is a patch, not a fix. The additive path lets gradients flow back if the gates choose to let them. Training the gates to do so requires gradients to flow back through the gates too, which is recursive. In practice, LSTMs/GRUs train more stably than vanilla RNNs but still struggle on sequences much longer than ~100 tokens. The fundamental problem — that a fixed-dimensional state must encode an arbitrarily long history — remains.

Empirical observation: what the lab will show¶

Lab 03 plots the gradient norm \(\|\partial L_T / \partial h_t\|\) vs \(t\) for a sequence of length 50, with \(L_T\) a loss at the final step. Three lines:

Vanilla RNN. The line falls off cliff-like, ~3 orders of magnitude per 10 steps. By \(t = 20\) (30 steps from the loss), the gradient is computationally indistinguishable from zero.
GRU. The line falls but more gradually, maybe 1 order of magnitude per 10 steps. The gradient at \(t = 20\) is small but nonzero.
LSTM (if you implement it). Similar to GRU. Sometimes a bit better, sometimes a bit worse, within noise of choice of initialization and gate biases.

This plot is the headline figure of Phase 14. Commit it; reference it from PHASE_14_REPORT.md.

Gradient clipping (mentioned, deferred)¶

Even with GRU/LSTM, gradients can occasionally explode — for instance, when a particularly bad batch produces an extreme pre-activation that pushes one gate into a runaway regime. The standard mitigation is gradient clipping: if \(\|\nabla L\|_2 > c\) (for some threshold \(c\), typically 1.0 or 5.0), rescale the gradient to have norm \(c\).

We don't train in Phase 14, so we don't need clipping here. But flag it: any Phase 18 training loop that operates on a recurrent model must clip. Without clipping, training will diverge on the first unlucky batch.

How to read the BPTT formula at a glance¶

A pattern emerges. Look at the gradient formula once more:

\[ \frac{\partial L_k}{\partial h_t} = \frac{\partial L_k}{\partial h_k} \cdot \prod_{j=t+1}^{k} \left[ \text{diag}(1 - \tanh^2(z_j)) \cdot W_{hh} \right] \]

Three pieces, in product:

The "downstream" gradient \(\partial L_k / \partial h_k\). What the loss thinks of the state at step \(k\). Fixed.
The Jacobian product over time. Repeated multiplication by \(W_{hh}\) (modulated by tanh). This is the vanishing/exploding factor.
(Implicit) the "upstream" influence on parameter gradients. Each step's contribution to \(\partial L / \partial W_{hh}\) involves the same product. The pathology is global.

The pattern of "Jacobian of state-update applied many times" recurs in every deep recurrent architecture. The mitigation in modern recurrent models (Mamba, RetNet) is to constrain the state update to be a linear operator with a controlled spectrum — fundamentally a stricter version of "orthogonal initialization that stays orthogonal".

Attention, by contrast, side-steps the problem entirely. Phase 15.

What this phase does NOT cover¶

Truncated BPTT. A training trick where the gradient is computed only over the last \(k\) steps, ignoring the dependence on earlier states. Useful in practice; Phase 18 territory.
Backward pass of LSTM/GRU. Sketch only. Full derivation would be 4–6 pages and isn't a Phase 14 deliverable.
Initialization schemes for \(W_{hh}\). Orthogonal init, identity init, Le-init. Phase 10 covers init in general; the RNN-specific choice is mentioned in a paragraph.
Spectral analysis of trained \(W_{hh}\). What do the eigenvalues do during training? Interesting question; out of scope.
Recurrent residual connections. A workaround predating LSTMs (Le et al. 2015). Mentioned once.
Connection to differential equations. RNNs as Euler-discretized ODEs (Chen et al. 2018 NeuralODE). Out of scope.

A drill before lab¶

Given \(W_{hh} = 0.7 I_d\) (a scaled identity), what is \(\|\partial h_{20} / \partial h_0\|\)? Assume tanh derivatives all equal 1 (no saturation; best case).

\[ \frac{\partial h_{20}}{\partial h_0} = (0.7 I)^{20} = 0.7^{20} I \]

\[ \|0.7^{20} I\| = 0.7^{20} \approx 8 \times 10^{-4} \]

Three orders of magnitude down at 20 steps with the best-case nonlinearity. Now imagine the realistic case where tanh derivatives are sometimes \(0.5\) or \(0.1\) — the contraction is far worse.

This is the empirical fact that motivates the entirety of GRU, LSTM, attention, transformers, and modern AI.

Done with theory. Next: lab/00-tokenize-corpus.md.