Skip to content

English · Español

Break — In-place op on a tensor that requires grad; show the autograd graph break

🇪🇸 Una operación in-place (add_, mul_, relu_) sobre un tensor con requires_grad=True rompe la cinta de autograd. PyTorch a veces lo detecta y lanza un error claro; a veces produce gradientes incorrectos en silencio. Lo causamos a propósito en el mini-GPT y vemos ambas variantes.


Symptom Borja will see

Two implementations of the mini-GPT's residual connection inside a transformer block:

  • Run A (control):

    h = h + self.attn(self.ln1(h))   # out-of-place
    h = h + self.ffn(self.ln2(h))    # out-of-place
    

  • Run B (break):

    h.add_(self.attn(self.ln1(h)))   # in-place
    h.add_(self.ffn(self.ln2(h)))    # in-place
    

The forward outputs are numerically identical (the in-place op writes the same values). But on loss.backward():

  • Run A: completes normally. Gradients flow. Training continues.
  • Run B (variant 1, PyTorch ≥ 1.5 with torch.autograd.set_detect_anomaly(True)): raises
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.FloatTensor [1, 8, 64]], which is
output 0 of NativeLayerNormBackward, is at version 2; expected version 0
instead.
  • Run B (variant 2, no anomaly detection): may silently compute wrong gradients (if a saved tensor was the in-place op's input) or crash with a less informative error.

If anomaly detection is off and the in-place op happens to overwrite a tensor whose forward value the autograd graph saved, the backward computes the wrong gradient using the post-overwrite value. Training proceeds but converges to a different (wrong) minimum. Test accuracy is worse than baseline by 3-10%, but the run itself doesn't error.

The break, mechanically

Search-and-replace h = h + ... with h.add_(...) in src/minigpt/block.py. Or change F.relu(x) to F.relu_(x). Or change x.exp() to x.exp_(). Any of these is the break.

The minimal version:

# In `src/minigpt/block.py`, replace:
h = h + self.attn(self.ln1(h))
# With:
h.add_(self.attn(self.ln1(h)))

Why this teaches the concept

PyTorch's autograd works by saving the exact forward tensor values that the backward needs. For example:

  • NativeLayerNormBackward saves the LN's input (because the LN gradient depends on it).
  • The next op (self.attn) reads this LN output and produces an attention output.
  • The residual add (h + attn_output) is followed by another LN (self.ln2), which saves its input — which is the residual sum.

If the residual sum is computed in-place as h.add_(attn_output), PyTorch updates h's underlying storage. But:

  • The previous LN's saved tensor was also pointing at h's storage (because PyTorch saves references, not copies, for memory efficiency).
  • The in-place update changes the saved value.
  • During backward, when the previous LN's backward() reads its saved input, it gets the new value, not the original.

PyTorch detects this via a version counter on each tensor's storage. Every in-place op increments the counter. When backward reads a saved tensor, it checks the counter against the value at save time. Mismatch → RuntimeError.

Without anomaly detection, the version check still happens at backward, but only for specific ops that explicitly check. Some ops (rare) don't check; if you hit one, you get silent wrong gradients.

The lesson:

  1. In-place ops are an optimization (saves memory by reusing storage).
  2. They are unsafe for any tensor whose forward value is needed in backward.
  3. PyTorch's safety net catches most cases but not all.
  4. The discipline: prefer out-of-place ops unless you've audited that no backward-needed tensor depends on the storage.

Diagnostic ladder Borja should walk

  1. First check: the error message (if anomaly detection is on). It names the variable, the version expected vs got, and points to the originating op. This is the fastest diagnosis.
  2. Second check: enable anomaly detection if not on: torch.autograd.set_detect_anomaly(True) at the top of the training script.
  3. Third check: search the codebase for _ (underscore) suffixed ops on tensors with requires_grad. The grep is mechanical.
  4. Fourth check (if errors are silent): compare gradient values between Run A and Run B. They will differ.
  5. Diagnosis: an in-place op on a tensor that's saved for backward by some upstream node.

Reproducer

# Control
just phase-25-train inplace=false

# Break (with anomaly detection: clear error)
just phase-25-train inplace=true detect_anomaly=true

# Break (silent variant)
just phase-25-train inplace=true detect_anomaly=false

# Compare gradients
just phase-25-grad-compare experiments/25-A experiments/25-B

Hint cascade

  1. (Mild) "The error mentions 'version 2; expected version 0'. What changes a tensor's version?"
  2. (Medium) "Look for any *_ (underscore-suffix) method calls in the model code."
  3. (Direct) "The residual add is h.add_(...). PyTorch saves the pre-add value of h for the upstream LayerNorm's backward; your in-place add overwrites it."

Fix

Replace h.add_(x) with h = h + x. Or h = h.clone(); h.add_(x) if you specifically need the in-place mutation for some downstream reason (rare).

Generally, never use in-place ops on requires_grad=True tensors unless you have a very specific reason and have audited the graph.

When in-place IS safe

  • Optimizer step. param.data.add_(grad, alpha=-lr) — the optimizer's step() is outside the autograd graph (gradients have already been computed). Inside the optimizer.step() body, in-place is correct and saves memory.
  • Activations during inference. with torch.no_grad(): disables grad tracking; in-place is safe.
  • A fresh tensor with no autograd history. A torch.zeros(...) is fine to mutate.

The hazard is: in-place on a tensor that the autograd graph depends on. The fix isn't "never use add_" — it's "use add_ knowingly".

What this break is NOT

  • Not a numerical-precision bug.
  • Not an architectural bug.
  • Not a hyperparameter bug.

It is an autograd graph hazard. Specific to PyTorch (and other frameworks with similar dynamic-graph autograd). Not a hazard in NumPy (no autograd) or in "compile-then-run" frameworks (gradients are derived symbolically).

Why this is the §A13 grammar-tutor-relevant break

When you implement the Phase-32 grammar-tutor agent in PyTorch, you'll be tempted to use in-place ops because the model is small and memory feels free. The temptation is especially high in the action / observation handling code that you'll write fresh. This break is the cautionary tale: it costs nothing at §A13 scale to use out-of-place ops; it costs you a debugging session every six months to use in-place ops carelessly.

Cross-refs

  • theory/04-autograd-tape-walk-mini-gpt.md — the tape the break corrupts.
  • theory/02-autograd-engine.md — how saved tensors work.
  • Phase 8 — the hand-built autograd you wrote before PyTorch; this break would also break it.