English · Español
Phase 14 — Quiz (human-readable mirror)¶
🇪🇸 Espejo legible del canónico
data/quizzes/phase-14-pre-transformer-sequence.yaml.
Source: data/quizzes/phase-14-pre-transformer-sequence.yaml.
q-14-01 — Vanilla RNN gradient at depth (single)¶
- O(T)
- O(r^T) ✓
- O(log T)
- O(T^r)
Each backward step multiplies by
W_h · diag(1 - h²) ≈ W_h. Compound T-1 times: O(r^(T-1)) which is O(r^T). r<1 vanishes; r>1 explodes.
q-14-02 — What does the LSTM cell state make controllable? (multi)¶
- It is exactly f_t, with no weight matrix multiplied in ✓
- It is bounded in [0, 1] per step ✓
- It is bounded in [-1, 1] per step
- It can be learned to be close to 1 (no forgetting) ✓
∂c_t/∂c_{t-1} = f_texactly. With sigmoid,f_t ∈ (0, 1). The network learnsf_t ≈ 1when it should remember — the LSTM gradient highway.
q-14-03 — Path length from output to input (free)¶
Expected to contain: 1.
O(1) — one matmul connects every input to every output. RNN/LSTM is O(T). Vaswani et al. 2017 Table 1.
q-14-04 — Find the bug: LSTM diverges immediately (single)¶
An LSTM training run produces NaN loss within 20 steps. The cell state exceeds 1e30 just before NaN.
- Wrong initialization scale on h_0
- Sigmoid missing on the forget/input/output gates ✓
- Tanh missing on the candidate g
- Wrong batch dimension in the einsum
Without σ on
f, the forget gate is unbounded andc_tgrows without limit. Seebreak/00-break-disable-lstm-gate-sigmoid.md.
q-14-05 — Why does attention beat RNN even at short sequences? (free)¶
Expected to contain: path.
Even at T=5, RNN's gradient at the first token is multiplied by
r^4 ≈ 0.41(r=0.8), while attention's gradient is O(1). The path-length argument applies at all T.