English · Español
00 — Motivation: why we need Module and Parameter¶
🇪🇸 Phase 8 nos dejó con
Tensors y un autograd que funciona. Pero si intentas entrenar una red — incluso una sencilla — descubres rápido que necesitas (1) una manera estándar de marcar "este tensor es un peso aprendible" y (2) una manera de encontrar todos los pesos de un modelo recursivamente para pasárselos al optimizador.Parameterresuelve la primera;Module(con su registro automático vía__setattr__) resuelve la segunda. Es ergonomía pura — pero buena ergonomía es la diferencia entre escribir una red en 10 líneas o en 100.
The problem Module and Parameter solve¶
Suppose you have Phase 8's Tensor and want to train a 2-layer MLP. The naive approach:
W1 = Tensor(rng.standard_normal((23, 16)), requires_grad=True)
b1 = Tensor(np.zeros(16), requires_grad=True)
W2 = Tensor(rng.standard_normal((16, 5)), requires_grad=True)
b2 = Tensor(np.zeros(5), requires_grad=True)
def forward(x):
h = (x @ W1 + b1).relu()
return h @ W2 + b2
# Training loop
params = [W1, b1, W2, b2] # ← maintain this list by hand
for _ in range(100):
loss = cross_entropy(forward(x), y)
loss.backward()
for p in params:
p.data -= 0.01 * p.grad
p.grad = None
This works. It even scales — for two layers. But notice:
paramsis maintained by hand. Add a third layer, forget to extendparams, and that layer never updates. The bug is silent — loss decreases (because the other layers still train), just less than it should.- No encapsulation. "What is this model?" has no answer — it's the union of
W1, b1, W2, b2, forward. There's no object you can pass around, save, load, or print. - No reuse. If you want a 3-layer MLP, you copy-paste the layer code. No
Linearabstraction.
The fix is two ideas:
- A
Parameterclass — aTensorwithrequires_grad=Trueand a marker that says "I'm an owned weight". The marker is just a subclass (class Parameter(Tensor): pass); the magic is thatModuleknows how to find Parameters by reflection. - A
Modulebase class with two responsibilities: - When you do
self.W1 = Parameter(...)insideModule.__init__, the base class intercepts the assignment (via__setattr__) and registersW1in an internal_parametersdict. module.parameters()walks the_parametersdict and recursively visits submodule_modulesto yield everyParameterin the tree.
With these two pieces, the MLP becomes:
class TenseMLP(Module):
def __init__(self):
super().__init__()
self.fc1 = Linear(23, 16)
self.fc2 = Linear(16, 5)
def forward(self, x):
return self.fc2(self.fc1(x).relu())
model = TenseMLP()
optim = SGD(model.parameters(), lr=0.01)
for _ in range(100):
optim.zero_grad()
loss = cross_entropy(model(x), y)
loss.backward()
optim.step()
Five lines for the training loop, one definition for the model. No parameter list to maintain. Add a third layer — self.fc3 = Linear(5, 5) — and it's automatically picked up. That is what Module gives you.
Why __setattr__ and not __init__ magic?¶
Two alternatives to __setattr__ registration:
- Alternative A — explicit
register_parameter("W1", p). This works (PyTorch supports it as the underlying API). But it's verbose — every parameter needs two lines:p = Parameter(...)thenself.register_parameter("W1", p). Most code wants theself.W1 = Parameter(...)shorthand. - Alternative B —
__init_subclass__introspection. Pythonically clever but hard to debug. You scan class attributes at definition time. Doesn't handle parameters created at instance time (e.g., embedding tables whose size depends on__init__args).
PyTorch picked __setattr__ and the design has aged well. We copy it.
The mechanic in 20 lines (Lab 00 builds this):
class Module:
def __init__(self):
# Use object.__setattr__ to bypass our own __setattr__ — chicken-and-egg.
object.__setattr__(self, "_parameters", {})
object.__setattr__(self, "_modules", {})
def __setattr__(self, name, value):
if isinstance(value, Parameter):
self._parameters[name] = value
elif isinstance(value, Module):
self._modules[name] = value
object.__setattr__(self, name, value)
def parameters(self):
yield from self._parameters.values()
for m in self._modules.values():
yield from m.parameters()
That's the whole framework. Everything else (Linear, Sequential, Adam) sits on top of this.
🇪🇸 La idea más sutil: tener que llamar a
super().__init__()antes de asignar el primer Parameter. Si lo olvidas,_parametersno existe todavía, y la primera asignación falla. Pyt orch tiene este mismo "gotcha" — es el primer error que todo principiante encuentra.
Parameter vs Tensor: the only difference¶
Parameter is not a richer class than Tensor. It is a Tensor with two things:
requires_grad=Trueby default (because parameters are always learnable).- It's an instance of the
Parametersubclass, soisinstance(x, Parameter)distinguishes it from a regularTensor— which is what__setattr__needs.
That's it. No new ops, no new methods, no new state. The whole point of Parameter is to be a marker.
class Parameter(Tensor):
def __init__(self, data, requires_grad=True):
super().__init__(data, requires_grad=requires_grad)
(Two-line class. The work is entirely in Module.)
What Module gives you for free¶
Once registration works, several conveniences fall out:
- Recursive parameter enumeration.
model.parameters()yields everyParameterin the model tree. - Recursive
zero_grad(). Walks the tree, setsp.grad = Noneon every Parameter. - Recursive
state_dict(). Serializable representation:{"fc1.weight": ndarray, "fc1.bias": ndarray, ...}. - Recursive
load_state_dict(state). Inverse of the above. print(model)that walks the tree and emits something readable.- Forward as
__call__.model(x)callsmodel.forward(x). Convention from PyTorch.
PyTorch's nn.Module adds more (hooks, train/eval mode, buffers, device transfer) — Phase 9 implements the minimum. Phase 10 adds train/eval; Phase 17 adds buffers (for positional embeddings); Phase 18 adds hooks (for gradient logging).
Why use the PyTorch convention?¶
Two reasons:
- Ergonomic transfer. The whole point of Phase 9's port drill (
experiments/09-pytorch-port-drill/) is to confirm that the API is close enough to PyTorch that you can port a small script in 30 minutes. The closer the API, the smoother the transfer. - Future bridges. Phase 24 imports PyTorch and uses real
nn.Module. The mental model you build in Phase 9 must be the same one Phase 24 uses. Ifminimodel'sModuleworked totally differently, Phase 24's onboarding would cost extra cognitive effort.
We are not slavishly copying PyTorch — we omit much (device, dtype, hooks, buffers). But what we do implement matches PyTorch's API down to method names and parameter ordering.
Topic anchor (§A13)¶
The MLP we'll train at the end of this phase has input space (one-hot verb ⊕ one-hot person) — 23 dimensions — and output space (logits over 5 tenses). The model class is TenseMLP. Its submodules are fc1: Linear(23, 16) and fc2: Linear(16, 5). Its parameters() method must yield fc1.weight, fc1.bias, fc2.weight, fc2.bias in that order. The grammar grid (20 verbs × 3 persons × 5 tenses = 300 triples) is the source of training data.
By the end of Phase 9, Borja owns enough framework to write any small MLP — including the one that drives the §A13 conjugation tutor in Phase 32.
What this page does NOT cover¶
Module.train()/eval()modes. Stubbed as a no-op in Phase 9; Phase 10's BatchNorm activates them.- Buffers (non-learnable persistent state). Phase 11 (embeddings, positional encodings) needs them.
nn.functionalvsnn.Moduleduality. PyTorch has both. We have a thinminitorch.functional(Phase 8) andminimodel.nn.*(Phase 9). The duality is real:nn.CrossEntropyLoss()(logits, targets)callsminitorch.cross_entropy(logits, targets)under the hood. Document at phase open.
One-paragraph recap¶
Module and Parameter are an ergonomics layer over Phase 8's Tensor. Parameter is a Tensor subclass with requires_grad=True and an isinstance marker. Module uses __setattr__ to intercept attribute assignment, register Parameters and submodules in dicts, and expose a recursive parameters() method. ~20 lines of cleverness; the rest is straightforward composition. PyTorch's API is copied closely because it has aged well and because the Phase 24 onboarding to real PyTorch will be smoother for it.