English · Español
Break — In-place op on a tensor that requires grad; show the autograd graph break¶
🇪🇸 Una operación in-place (
add_,mul_,relu_) sobre un tensor conrequires_grad=Truerompe la cinta de autograd. PyTorch a veces lo detecta y lanza un error claro; a veces produce gradientes incorrectos en silencio. Lo causamos a propósito en el mini-GPT y vemos ambas variantes.
Symptom Borja will see¶
Two implementations of the mini-GPT's residual connection inside a transformer block:
-
Run A (control):
-
Run B (break):
The forward outputs are numerically identical (the in-place op writes the same values). But on loss.backward():
- Run A: completes normally. Gradients flow. Training continues.
- Run B (variant 1, PyTorch ≥ 1.5 with
torch.autograd.set_detect_anomaly(True)): raises
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.FloatTensor [1, 8, 64]], which is
output 0 of NativeLayerNormBackward, is at version 2; expected version 0
instead.
- Run B (variant 2, no anomaly detection): may silently compute wrong gradients (if a saved tensor was the in-place op's input) or crash with a less informative error.
If anomaly detection is off and the in-place op happens to overwrite a tensor whose forward value the autograd graph saved, the backward computes the wrong gradient using the post-overwrite value. Training proceeds but converges to a different (wrong) minimum. Test accuracy is worse than baseline by 3-10%, but the run itself doesn't error.
The break, mechanically¶
Search-and-replace h = h + ... with h.add_(...) in src/minigpt/block.py. Or change F.relu(x) to F.relu_(x). Or change x.exp() to x.exp_(). Any of these is the break.
The minimal version:
# In `src/minigpt/block.py`, replace:
h = h + self.attn(self.ln1(h))
# With:
h.add_(self.attn(self.ln1(h)))
Why this teaches the concept¶
PyTorch's autograd works by saving the exact forward tensor values that the backward needs. For example:
NativeLayerNormBackwardsaves the LN's input (because the LN gradient depends on it).- The next op (
self.attn) reads this LN output and produces an attention output. - The residual add (
h + attn_output) is followed by another LN (self.ln2), which saves its input — which is the residual sum.
If the residual sum is computed in-place as h.add_(attn_output), PyTorch updates h's underlying storage. But:
- The previous LN's saved tensor was also pointing at
h's storage (because PyTorch saves references, not copies, for memory efficiency). - The in-place update changes the saved value.
- During backward, when the previous LN's
backward()reads its saved input, it gets the new value, not the original.
PyTorch detects this via a version counter on each tensor's storage. Every in-place op increments the counter. When backward reads a saved tensor, it checks the counter against the value at save time. Mismatch → RuntimeError.
Without anomaly detection, the version check still happens at backward, but only for specific ops that explicitly check. Some ops (rare) don't check; if you hit one, you get silent wrong gradients.
The lesson:
- In-place ops are an optimization (saves memory by reusing storage).
- They are unsafe for any tensor whose forward value is needed in backward.
- PyTorch's safety net catches most cases but not all.
- The discipline: prefer out-of-place ops unless you've audited that no backward-needed tensor depends on the storage.
Diagnostic ladder Borja should walk¶
- First check: the error message (if anomaly detection is on). It names the variable, the version expected vs got, and points to the originating op. This is the fastest diagnosis.
- Second check: enable anomaly detection if not on:
torch.autograd.set_detect_anomaly(True)at the top of the training script. - Third check: search the codebase for
_(underscore) suffixed ops on tensors withrequires_grad. The grep is mechanical. - Fourth check (if errors are silent): compare gradient values between Run A and Run B. They will differ.
- Diagnosis: an in-place op on a tensor that's saved for backward by some upstream node.
Reproducer¶
# Control
just phase-25-train inplace=false
# Break (with anomaly detection: clear error)
just phase-25-train inplace=true detect_anomaly=true
# Break (silent variant)
just phase-25-train inplace=true detect_anomaly=false
# Compare gradients
just phase-25-grad-compare experiments/25-A experiments/25-B
Hint cascade¶
- (Mild) "The error mentions 'version 2; expected version 0'. What changes a tensor's version?"
- (Medium) "Look for any
*_(underscore-suffix) method calls in the model code." - (Direct) "The residual add is
h.add_(...). PyTorch saves the pre-add value ofhfor the upstream LayerNorm's backward; your in-place add overwrites it."
Fix¶
Replace h.add_(x) with h = h + x. Or h = h.clone(); h.add_(x) if you specifically need the in-place mutation for some downstream reason (rare).
Generally, never use in-place ops on requires_grad=True tensors unless you have a very specific reason and have audited the graph.
When in-place IS safe¶
- Optimizer step.
param.data.add_(grad, alpha=-lr)— the optimizer'sstep()is outside the autograd graph (gradients have already been computed). Inside theoptimizer.step()body, in-place is correct and saves memory. - Activations during inference.
with torch.no_grad():disables grad tracking; in-place is safe. - A fresh tensor with no autograd history. A
torch.zeros(...)is fine to mutate.
The hazard is: in-place on a tensor that the autograd graph depends on. The fix isn't "never use add_" — it's "use add_ knowingly".
What this break is NOT¶
- Not a numerical-precision bug.
- Not an architectural bug.
- Not a hyperparameter bug.
It is an autograd graph hazard. Specific to PyTorch (and other frameworks with similar dynamic-graph autograd). Not a hazard in NumPy (no autograd) or in "compile-then-run" frameworks (gradients are derived symbolically).
Why this is the §A13 grammar-tutor-relevant break¶
When you implement the Phase-32 grammar-tutor agent in PyTorch, you'll be tempted to use in-place ops because the model is small and memory feels free. The temptation is especially high in the action / observation handling code that you'll write fresh. This break is the cautionary tale: it costs nothing at §A13 scale to use out-of-place ops; it costs you a debugging session every six months to use in-place ops carelessly.
Cross-refs¶
theory/04-autograd-tape-walk-mini-gpt.md— the tape the break corrupts.theory/02-autograd-engine.md— how saved tensors work.- Phase 8 — the hand-built autograd you wrote before PyTorch; this break would also break it.