English · Español
Break 00 — Disable the LSTM gate sigmoid (replace with identity)¶
🇪🇸 Rompemos las puertas: el
σde las puertasforget,inputyoutputse reemplaza por la identidad. Las puertas dejan de estar en(0, 1)y pueden tomar valores no acotados. Predice: la celdac_tcrece sin control; el gradiente explota; la pérdida esNaNen pocos pasos.Anchors:
LYNX_CORTEX.md§4 / PHASE 14; theory §02 RNN recurrence; theory §04 gradient flow;.claude/commands/break.md.
The break¶
In src/minimodel/nn/lstm.py:
class LSTM(Module):
def forward(self, x_t: Tensor, h_prev: Tensor, c_prev: Tensor) -> tuple[Tensor, Tensor]:
z = self.input_to_hidden(x_t) + self.hidden_to_hidden(h_prev)
i, f, g, o = z.split(4, dim=-1)
# BUG: removed the sigmoids on the gates.
i = i # was: i.sigmoid()
f = f # was: f.sigmoid()
o = o # was: o.sigmoid()
g = g.tanh() # candidate keeps tanh (unchanged)
c_t = f * c_prev + i * g
h_t = o * c_t.tanh()
return h_t, c_t
Three single-line changes. Note we keep g.tanh() so the candidate cell update is still bounded. The forget/input/output gates lose their bounded-ness.
Predict, then run¶
The forget gate's correct range is (0, 1). With identity, f_t can be -3, 5, -100, anything. The cell update:
Now f_t is unbounded, so c_t is no longer guaranteed bounded. Worse: in backward,
so the gradient highway through c_t is now Π_{t} f_t — a product of unbounded reals. Exploding gradient is the dominant failure.
Predictions¶
NaNin training loss within 5-20 steps asc_toverflows fp32 (>3.4e38).- If you reduce the LR by 1000×, the model survives a bit longer but converges to a degenerate solution where
f_tsaturates to ~0 (effectively forgetting everything — the LSTM has become a feedforward map). - Gradient norms shoot through
1e10before the NaN.
Write predictions in learners/borja/phase-14/notes/breaks.md before running.
Observe¶
Diagnostics:
- Plot
||c_t||_∞per step. Should grow without bound. - Plot gradient norms —
>1e6is the danger zone. - After NaN: print
i, f, ovalues from the last successful step. They should be far outside(0, 1).
Symptom Borja will see¶
loss = NaNin <20 steps.c_t.max() > 1e30in the step before NaN.- Gradient norm
>1e8in the steps leading to NaN.
Hidden cause (one sentence)¶
The sigmoid was removed from the forget/input/output gates, making them unbounded — c_t = f_t * c_{t-1} + i_t * g_t overflows because f_t is no longer in (0, 1).
Hint cascade¶
- What is the range of each LSTM gate, and why? Print the empirical range during the run.
- Compute
c_tover a few steps by hand — what happens iff_t > 1? - Look at the LSTM forward in
nn/lstm.py. Are all four pre-gate activations passed through their correct nonlinearity?
Fix diff¶
Why this teaches the concept¶
The sigmoid in the LSTM gates isn't aesthetic — it's the mathematical requirement for the gating mechanism. f_t ∈ (0, 1) is what makes c_t = f_t · c_{t-1} + ... a controllable forget operation. Without the sigmoid, you no longer have "an LSTM" — you have an arbitrary unbounded linear recurrence, which is exactly the model class LSTMs were invented to avoid (Hochreiter & Schmidhuber 1997, motivation section). This break is a hands-on demonstration of why activation functions in gated architectures (LSTM, GRU, attention's softmax, expert gates in MoE) are constrained for the gradient-flow reason — not for stylistic reasons.
Next: Phase 15's /break on missing sqrt(d_k) scaling.