Skip to content

English · Español

Break 00 — Disable the LSTM gate sigmoid (replace with identity)

🇪🇸 Rompemos las puertas: el σ de las puertas forget, input y output se reemplaza por la identidad. Las puertas dejan de estar en (0, 1) y pueden tomar valores no acotados. Predice: la celda c_t crece sin control; el gradiente explota; la pérdida es NaN en pocos pasos.

Anchors: LYNX_CORTEX.md §4 / PHASE 14; theory §02 RNN recurrence; theory §04 gradient flow; .claude/commands/break.md.


The break

In src/minimodel/nn/lstm.py:

class LSTM(Module):
    def forward(self, x_t: Tensor, h_prev: Tensor, c_prev: Tensor) -> tuple[Tensor, Tensor]:
        z = self.input_to_hidden(x_t) + self.hidden_to_hidden(h_prev)
        i, f, g, o = z.split(4, dim=-1)
        # BUG: removed the sigmoids on the gates.
        i = i               # was: i.sigmoid()
        f = f               # was: f.sigmoid()
        o = o               # was: o.sigmoid()
        g = g.tanh()        # candidate keeps tanh (unchanged)

        c_t = f * c_prev + i * g
        h_t = o * c_t.tanh()
        return h_t, c_t

Three single-line changes. Note we keep g.tanh() so the candidate cell update is still bounded. The forget/input/output gates lose their bounded-ness.

Predict, then run

The forget gate's correct range is (0, 1). With identity, f_t can be -3, 5, -100, anything. The cell update:

\[ c_t = f_t \odot c_{t-1} + i_t \odot g_t \]

Now f_t is unbounded, so c_t is no longer guaranteed bounded. Worse: in backward,

\[ \frac{\partial c_t}{\partial c_{t-1}} = f_t \quad (\text{unbounded}) \]

so the gradient highway through c_t is now Π_{t} f_t — a product of unbounded reals. Exploding gradient is the dominant failure.

Predictions

  1. NaN in training loss within 5-20 steps as c_t overflows fp32 (>3.4e38).
  2. If you reduce the LR by 1000×, the model survives a bit longer but converges to a degenerate solution where f_t saturates to ~0 (effectively forgetting everything — the LSTM has become a feedforward map).
  3. Gradient norms shoot through 1e10 before the NaN.

Write predictions in learners/borja/phase-14/notes/breaks.md before running.

Observe

just exp 14-train-lstm --tag broken-no-gate-sigmoid

Diagnostics:

  1. Plot ||c_t||_∞ per step. Should grow without bound.
  2. Plot gradient norms — >1e6 is the danger zone.
  3. After NaN: print i, f, o values from the last successful step. They should be far outside (0, 1).

Symptom Borja will see

  • loss = NaN in <20 steps.
  • c_t.max() > 1e30 in the step before NaN.
  • Gradient norm >1e8 in the steps leading to NaN.

Hidden cause (one sentence)

The sigmoid was removed from the forget/input/output gates, making them unbounded — c_t = f_t * c_{t-1} + i_t * g_t overflows because f_t is no longer in (0, 1).

Hint cascade

  1. What is the range of each LSTM gate, and why? Print the empirical range during the run.
  2. Compute c_t over a few steps by hand — what happens if f_t > 1?
  3. Look at the LSTM forward in nn/lstm.py. Are all four pre-gate activations passed through their correct nonlinearity?

Fix diff

i = i.sigmoid()
f = f.sigmoid()
o = o.sigmoid()

Why this teaches the concept

The sigmoid in the LSTM gates isn't aesthetic — it's the mathematical requirement for the gating mechanism. f_t ∈ (0, 1) is what makes c_t = f_t · c_{t-1} + ... a controllable forget operation. Without the sigmoid, you no longer have "an LSTM" — you have an arbitrary unbounded linear recurrence, which is exactly the model class LSTMs were invented to avoid (Hochreiter & Schmidhuber 1997, motivation section). This break is a hands-on demonstration of why activation functions in gated architectures (LSTM, GRU, attention's softmax, expert gates in MoE) are constrained for the gradient-flow reason — not for stylistic reasons.


Next: Phase 15's /break on missing sqrt(d_k) scaling.