Skip to content

English · Español

04 — Side-by-side gradient flow: RNN vs LSTM vs attention on the §A13 corpus

🇪🇸 Aquí ponemos los tres modelos secuenciales lado a lado y rastreamos el gradiente desde la pérdida hasta el primer token de la secuencia. Misma frase de §A13 ("she will work tomorrow", longitud 5). Vanilla RNN: gradiente cae como 1/r^T. LSTM: el gate de olvido lo mantiene cerca de 1. Attention: ruta directa con un solo paso de matmul. El gráfico final explica por qué attention ganó.

Anchors: LYNX_CORTEX.md §4 / PHASE 14; theory §02 RNN recurrence, §03 vanishing gradient; Phase 15 §02 scaled dot-product. Phase 16 RoPE notes the same comparison.


The setup

Sequence: ["she", "will", "work", "tomorrow", "."] (length T = 5). Each token embedded to d = 8 for arithmetic clarity. Hidden state size H = 8.

The loss L is the cross-entropy at the last position predicting ".". We trace ∂L/∂x_1 (gradient at the first token) for three models.


Model 1 — Vanilla RNN

h_t = tanh(W_h h_{t-1} + W_x x_t + b)
y_T = W_o h_T
L   = CE(y_T, target)

Backward:

\[ \frac{\partial L}{\partial h_T} = (\partial L / \partial y_T) \cdot W_o \]
\[ \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_{t+1}} \cdot W_h \cdot \mathrm{diag}(1 - h_{t+1}^2) \]

The factor W_h · diag(1 - h²) repeats T-1 times. Each application is a linear map with spectral radius r = |λ_max(W_h)|. So:

\[ \Bigl\| \frac{\partial L}{\partial h_1} \Bigr\| \approx r^{T-1} \cdot \Bigl\| \frac{\partial L}{\partial h_T} \Bigr\| \]

For T = 5 and r = 0.7 (typical at init): the gradient at h_1 is 0.7^4 ≈ 0.24 of the gradient at h_T. For T = 50, it's 0.7^49 ≈ 1.4e-8 — vanishing.

For r = 1.3: 1.3^49 ≈ 7700 — exploding.

The §A13 sequences are short (T ≤ 10) so vanishing is survivable but already noticeable. Lab 03 measures the empirical gradient norm vs depth — Borja will see the exponential decay.


Model 2 — LSTM

f_t = σ(W_f x_t + U_f h_{t-1})       # forget gate
i_t = σ(W_i x_t + U_i h_{t-1})       # input gate
o_t = σ(W_o x_t + U_o h_{t-1})       # output gate
g_t = tanh(W_g x_t + U_g h_{t-1})    # candidate
c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t      # cell state ← KEY
h_t = o_t ⊙ tanh(c_t)

The cell state c_t has a (mostly) additive update. The backward through c_t → c_{t-1} is:

\[ \frac{\partial c_t}{\partial c_{t-1}} = f_t \]

That's it. Not multiplied by a weight matrix, not multiplied by a non-linearity's derivative. Just the forget gate f_t ∈ (0, 1).

So:

\[ \frac{\partial L}{\partial c_1} = \prod_{t=2}^{T} f_t \cdot \frac{\partial L}{\partial c_T} \]

If the network learns f_t ≈ 1 (don't forget), the product stays near 1 even for T = 100. The LSTM's contribution is making the gradient highway through c_t controllable, while RNN's tanh(W h) highway was uncontrollable.

For §A13's T = 5: even if f_t = 0.9, the gradient is 0.9^4 ≈ 0.66 — a far gentler decay than RNN's 0.7^4 ≈ 0.24.

The LSTM still has problems (the gate sigmoids saturate, the h_t path has the same vanishing as RNN), but the c_t highway is the headline win.


Model 3 — Self-attention (single-head, length 5)

Q = X W_Q        # (T, d_k) = (5, 8)
K = X W_K        # (T, d_k)
V = X W_V        # (T, d_v) = (5, 8)
A = softmax(Q K.T / sqrt(d_k))  # (T, T) = (5, 5)
Y = A V          # (T, d_v)
y_T = W_o Y_T
L   = CE(y_T, target)

Backward to x_1 (the embedding of "she"):

\[ \frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial X_{1,\cdot}} \]

X_1 enters the computation via Q_1, K_1, V_1 — and through A (whose row t weights V_1 by A[t,1]). The gradient back to x_1 is one matmul deep through A:

\[ \frac{\partial L}{\partial X_{1, \cdot}} \approx \sum_{t=1}^{T} A_{t, 1} \cdot (\text{gradient through V}) + (\text{gradient through Q, K}) \]

There is no compound multiplication that grows or shrinks with T. The gradient depth from output to input is constant in T (Vaswani et al. 2017, §4 "Why Self-Attention" — the path length argument).

For §A13 at T = 5 (or T = 500): the gradient at the first token is O(1), not O(r^T).


The headline numbers

Symbolic gradient at the first token (in T, with r = 0.9 for RNN, f = 0.9 for LSTM):

Model Gradient magnitude at x_1 At T=5 At T=50
Vanilla RNN O(r^T) 0.66 0.005
LSTM (c path) O(f^T) (controllable) 0.66 0.005
LSTM (h path) Same as RNN — but c is the highway - -
Attention (1 head) O(1) ~1 ~1
Attention (L layers) O(1)^L = O(1) ~1 ~1

Plus a critical second metric: path length from output to first input:

  • RNN: T (each step is one node).
  • LSTM: T (gates do flatten it, but still T conceptual hops).
  • Attention: 1 (one matmul connects every input to every output).

This is the architectural reason attention won. Not "more parameters" — the path-length argument is shorter and the gradient flow is by construction controllable.


What §A13 specifically shows

Lab 03 trains three models on the same §A13 next-token-prediction task and reports:

Model           | Steps to 90% val acc | Final val acc | Train wall time
----------------+----------------------+---------------+----------------
RNN (h=64)      | did not converge     | 28%           | 5 min × no-go
LSTM (h=64)     | 800                  | 92%           | 8 min
Attention (1L)  | 200                  | 96%           | 3 min

(Numbers approximate — Phase 14 lab 03 measures the real values. The headline is: attention converges faster and to higher accuracy on the same compute budget.)


What attention does not give you for free

  1. It's O(T²) in memory and compute (Phase 15). For very long sequences, this dominates.
  2. It's permutation-equivariant without positional encoding (Phase 16). Add PE.
  3. It's not naturally causal — for language modeling you must mask future positions (Phase 15 §04).
  4. It needs multi-head to capture different relations (Phase 15 §03). Single-head is a thin shadow of the full mechanism.

These are addressed in Phases 15, 16, and 17. But the gradient-flow win of this section is what motivates the whole transformer.


Citations

  • Hochreiter, S., Schmidhuber, J. 1997. "Long short-term memory." Neural Computation 9(8):1735–1780. The original LSTM paper.
  • Pascanu, R., Mikolov, T., Bengio, Y. 2013. "On the difficulty of training recurrent neural networks." arXiv:1211.5063 — vanishing/exploding gradient analysis.
  • Vaswani, A. et al. 2017. "Attention is All You Need." arXiv:1706.03762. Section 4 (table 1) compares path lengths for RNN vs CNN vs self-attention.

One-paragraph recap

A 5-token §A13 sentence's gradient back-flow to the first token: RNN multiplies the gradient by r^(T-1) (vanishes for r<1, explodes for r>1); LSTM via the cell state c_t multiplies by ∏ f_t (controllable via learned forget gates); attention multiplies by nothing — the path length is constant in T. This is why attention won architecturally. The §A13 lab measures the empirical gradient at the first token across all three models on the same task and produces the side-by-side plot.


Prev: 03-vanishing-gradient.md Next: Phase 15 (attention).