English · Español

Phase 14 — Quiz (human-readable mirror)¶

🇪🇸 Espejo legible del canónico data/quizzes/phase-14-pre-transformer-sequence.yaml.

Source: data/quizzes/phase-14-pre-transformer-sequence.yaml.

q-14-01 — Vanilla RNN gradient at depth (single)¶

O(T)
O(r^T) ✓
O(log T)
O(T^r)

Each backward step multiplies by W_h · diag(1 - h²) ≈ W_h. Compound T-1 times: O(r^(T-1)) which is O(r^T). r<1 vanishes; r>1 explodes.

q-14-02 — What does the LSTM cell state make controllable? (multi)¶

It is exactly f_t, with no weight matrix multiplied in ✓
It is bounded in [0, 1] per step ✓
It is bounded in [-1, 1] per step
It can be learned to be close to 1 (no forgetting) ✓

∂c_t/∂c_{t-1} = f_t exactly. With sigmoid, f_t ∈ (0, 1). The network learns f_t ≈ 1 when it should remember — the LSTM gradient highway.

q-14-03 — Path length from output to input (free)¶

Expected to contain: 1.

O(1) — one matmul connects every input to every output. RNN/LSTM is O(T). Vaswani et al. 2017 Table 1.

q-14-04 — Find the bug: LSTM diverges immediately (single)¶

An LSTM training run produces NaN loss within 20 steps. The cell state exceeds 1e30 just before NaN.

Wrong initialization scale on h_0
Sigmoid missing on the forget/input/output gates ✓
Tanh missing on the candidate g
Wrong batch dimension in the einsum

Without σ on f, the forget gate is unbounded and c_t grows without limit. See break/00-break-disable-lstm-gate-sigmoid.md.

q-14-05 — Why does attention beat RNN even at short sequences? (free)¶

Expected to contain: path.

Even at T=5, RNN's gradient at the first token is multiplied by r^4 ≈ 0.41 (r=0.8), while attention's gradient is O(1). The path-length argument applies at all T.