Skip to content

English · Español

Phase 14 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del canónico data/quizzes/phase-14-pre-transformer-sequence.yaml.

Source: data/quizzes/phase-14-pre-transformer-sequence.yaml.


q-14-01 — Vanilla RNN gradient at depth (single)

  • O(T)
  • O(r^T)
  • O(log T)
  • O(T^r)

Each backward step multiplies by W_h · diag(1 - h²) ≈ W_h. Compound T-1 times: O(r^(T-1)) which is O(r^T). r<1 vanishes; r>1 explodes.


q-14-02 — What does the LSTM cell state make controllable? (multi)

  • It is exactly f_t, with no weight matrix multiplied in
  • It is bounded in [0, 1] per step
  • It is bounded in [-1, 1] per step
  • It can be learned to be close to 1 (no forgetting)

∂c_t/∂c_{t-1} = f_t exactly. With sigmoid, f_t ∈ (0, 1). The network learns f_t ≈ 1 when it should remember — the LSTM gradient highway.


q-14-03 — Path length from output to input (free)

Expected to contain: 1.

O(1) — one matmul connects every input to every output. RNN/LSTM is O(T). Vaswani et al. 2017 Table 1.


q-14-04 — Find the bug: LSTM diverges immediately (single)

An LSTM training run produces NaN loss within 20 steps. The cell state exceeds 1e30 just before NaN.

  • Wrong initialization scale on h_0
  • Sigmoid missing on the forget/input/output gates
  • Tanh missing on the candidate g
  • Wrong batch dimension in the einsum

Without σ on f, the forget gate is unbounded and c_t grows without limit. See break/00-break-disable-lstm-gate-sigmoid.md.


q-14-05 — Why does attention beat RNN even at short sequences? (free)

Expected to contain: path.

Even at T=5, RNN's gradient at the first token is multiplied by r^4 ≈ 0.41 (r=0.8), while attention's gradient is O(1). The path-length argument applies at all T.