English · Español

01 — References, Mutation, and the GIL¶

🇪🇸 Tres conceptos de Python que parecen "ya los sé" pero que se vuelven importantes cuando trabajas con tensores: (1) toda variable en Python es una etiqueta sobre un objeto, no el objeto en sí; (2) mutar un objeto compartido cambia todas sus etiquetas; (3) el GIL te impide paralelizar Python puro, pero las llamadas a C (incluyendo NumPy) lo liberan — por eso numpy.einsum puede usar todos tus núcleos.

Python is reference-only¶

In Python, there are no "values" at the language level — only references to objects. Every variable is a label bound to an object somewhere in the heap. Assignment rebinds labels; it never copies.

a = [1, 2, 3]   # `a` labels a new list object
b = a           # `b` labels the SAME list object as `a`
b.append(4)
print(a)        # [1, 2, 3, 4]  ← mutation through `b` visible via `a`

a is b is True. They are not two lists; they are two names for one list.

This is fundamental, and most experienced programmers know it for list, dict, etc. The reason it shows up here is that NumPy arrays inherit this, and Tensor objects in minigrad will too:

import numpy as np
x = np.arange(5)        # x labels an array
y = x                   # y labels the SAME array
y[0] = 99
print(x)                # [99 1 2 3 4]

But also:

z = x[1:4]              # z is a VIEW: a different array object, but shares the same underlying buffer
z[0] = 42
print(x)                # [99 42 2 3 4]

z is x is False. They are different ndarray objects. But z.base is x is True, and writes through z mutate x's buffer because the buffer is shared. This is the substrate for §2's strides-and-views material.

When this bites in AI code¶

A common pattern in training loops:

weights = model.parameters()       # returns a list of references
saved_weights = weights            # NOT a backup — same list, same tensors
optimizer.step()                   # mutates each parameter's data IN PLACE
# `saved_weights` is now the post-step weights. The pre-step state is gone.

The fix is copy.deepcopy(weights) or [w.clone() for w in weights] — depending on your tensor library, and depending on whether you want autograd metadata copied too. Phase 8/9 will revisit this.

Mutation is action-at-a-distance¶

The previous example generalizes: any object passed into a function can be mutated by that function, and the caller can't tell from the signature whether mutation happens.

def normalize_in_place(arr):
    arr -= arr.mean()      # mutates the caller's array
    arr /= arr.std()
    # nothing returned

def normalize_pure(arr):
    return (arr - arr.mean()) / arr.std()   # new array, caller's untouched

minigrad will follow the functional convention (Phase 8 BLUEPRINT). Every op returns a new Tensor; no op mutates its inputs. This is more memory-hungry than PyTorch's mixed approach, but it makes the autograd DAG unambiguous: a Tensor is born from one forward computation and never changes.

PyTorch itself has both functional (F.relu(x)) and in-place (x.relu_()) variants. The in-place ones have a trailing underscore. When you reach Phase 25 (PyTorch internals), notice how loss.backward() is in-place (mutates .grad on every parameter) while F.softmax is functional.

Identity, equality, hash¶

Three distinct concepts:

a is b — same object in memory (CPython: same id()).
a == b — values compare equal (calls __eq__).
hash(a) == hash(b) — used by dict / set membership.

For NumPy arrays, a == b returns a boolean array, not a scalar. To check elementwise equality of two arrays, use np.array_equal(a, b). For identity, a is b.

numpy.ndarray is unhashable (mutable). You cannot put arrays in a set or use them as dict keys. Tuples of array shapes can be keys; arrays themselves cannot.

Tensor in minigrad will be unhashable by default too — same reason.

The Global Interpreter Lock, demystified¶

The GIL is the lock that ensures only one Python bytecode instruction executes at a time per process. It exists because CPython's reference-counting garbage collector is not thread-safe without it.

Three consequences:

1. CPU-bound pure-Python code does not scale to multiple cores¶

def square_sum(n):
    return sum(i * i for i in range(n))

# Running this on 8 threads via threading.Thread: ~no speedup.
# Running it on 8 processes via multiprocessing: ~8x speedup.

This is the canonical "Python doesn't do threading" complaint. It's true for pure Python.

2. NumPy releases the GIL inside C calls¶

import numpy as np
a = np.random.randn(10_000_000)
b = a @ a.T   # while NumPy is doing this multiplication in C, the GIL is RELEASED

While @ is executing in C, another Python thread can run. This is why multithreaded data loaders are fast in PyTorch — the loader threads do file I/O and NumPy decoding (both release the GIL), and the training thread runs Python code concurrently.

The full rule: any function implemented in a C extension that explicitly calls Py_BEGIN_ALLOW_THREADS releases the GIL for the duration. NumPy's compute kernels do; many smaller utility functions don't (the overhead of releasing/reacquiring isn't worth it for short ops).

3. The GIL doesn't protect your objects from race conditions¶

Two threads incrementing a shared int counter via counter += 1 can still race — that statement compiles to multiple bytecode instructions, and the GIL can switch between them. Use threading.Lock or queue.Queue.

For minigrad, this matters in Phase 18 when we wire up a data loader. The training thread reads tensors; the loader thread writes to a queue. The queue itself is thread-safe (it has its own lock); the tensors inside should be immutable from the producer's perspective once enqueued.

4. Python 3.13+ and "no-GIL" builds¶

Free-threading CPython (PEP 703) is landing experimentally. We'll touch on it only if relevant by Phase 35. The mental model stays correct: NumPy releases the GIL, your Python code doesn't (unless you opt into no-GIL builds).

A worked tiny example¶

import numpy as np

class Counter:
    def __init__(self):
        self.value = 0
    def __iadd__(self, other):
        self.value += other
        return self

c1 = Counter()
c2 = c1
c1 += 5         # `c1` rebound? or `c1.value` mutated?
print(c2.value) # 5 — they share the same Counter object; __iadd__ mutated it

c3 = Counter()
c4 = c3
c3 = c3 + 5     # Wait, Counter has no __add__. TypeError? Let's say it did:
                # then c3 is rebound to a new Counter; c4 still labels the old one

The takeaway: x += y and x = x + y behave differently for mutable objects. Same applies to tensor += other_tensor in autograd: if tensor is a leaf parameter, in-place addition changes its .data and leaves the autograd graph in a defined state; out-of-place addition creates a new tensor with a different graph node.

Phase 8 will resolve this by making Tensor.__iadd__ raise NotImplementedError — functional only. Phase 25 (PyTorch internals) will show how PyTorch handles the same question.

Pitfalls to lock in¶

Default-argument mutability. def f(x, history=[]) shares history across all calls. Use history=None + history = history or [] inside.
list(d.keys()) to mutate during iteration. Modifying a dict while iterating raises RuntimeError. Wrap with list(...) to snapshot.
copy.deepcopy cost. Deepcopy traverses references; for a Tensor whose data is a 100 MB array, deepcopy allocates a new 100 MB array. Phase 18 considerations.
np.array(some_list_of_tensors). NumPy will try to make an object array (slow, broken). Stack with np.stack([t.data for t in tensors]) instead.
threading vs multiprocessing. For NumPy-heavy code: threading is fine (GIL released in C). For pure-Python compute: multiprocessing. For most ML data loaders: multiprocessing (because pickling tensors crosses process boundaries cleanly via shared memory or arrow).

One-paragraph recap¶

Python variables are labels, not values; assignment rebinds, never copies. Mutation through one label is visible through all labels pointing at the same object — this is the substrate for NumPy views and for the in-place-vs-functional design choices later. The GIL serializes Python bytecode but is released by NumPy's C kernels, which is what makes multithreaded data loading viable. Internalizing these three points eliminates a whole class of "but I copied it!" bugs that would otherwise surface in Phase 8 when Tensor objects start sharing underlying arrays through views.

Next: 02-strides-and-broadcasting.md