Skip to content

English · Español

00 — Motivation: why we need Module and Parameter

🇪🇸 Phase 8 nos dejó con Tensors y un autograd que funciona. Pero si intentas entrenar una red — incluso una sencilla — descubres rápido que necesitas (1) una manera estándar de marcar "este tensor es un peso aprendible" y (2) una manera de encontrar todos los pesos de un modelo recursivamente para pasárselos al optimizador. Parameter resuelve la primera; Module (con su registro automático vía __setattr__) resuelve la segunda. Es ergonomía pura — pero buena ergonomía es la diferencia entre escribir una red en 10 líneas o en 100.


The problem Module and Parameter solve

Suppose you have Phase 8's Tensor and want to train a 2-layer MLP. The naive approach:

W1 = Tensor(rng.standard_normal((23, 16)), requires_grad=True)
b1 = Tensor(np.zeros(16), requires_grad=True)
W2 = Tensor(rng.standard_normal((16, 5)), requires_grad=True)
b2 = Tensor(np.zeros(5), requires_grad=True)

def forward(x):
    h = (x @ W1 + b1).relu()
    return h @ W2 + b2

# Training loop
params = [W1, b1, W2, b2]   # ← maintain this list by hand
for _ in range(100):
    loss = cross_entropy(forward(x), y)
    loss.backward()
    for p in params:
        p.data -= 0.01 * p.grad
        p.grad = None

This works. It even scales — for two layers. But notice:

  1. params is maintained by hand. Add a third layer, forget to extend params, and that layer never updates. The bug is silent — loss decreases (because the other layers still train), just less than it should.
  2. No encapsulation. "What is this model?" has no answer — it's the union of W1, b1, W2, b2, forward. There's no object you can pass around, save, load, or print.
  3. No reuse. If you want a 3-layer MLP, you copy-paste the layer code. No Linear abstraction.

The fix is two ideas:

  1. A Parameter class — a Tensor with requires_grad=True and a marker that says "I'm an owned weight". The marker is just a subclass (class Parameter(Tensor): pass); the magic is that Module knows how to find Parameters by reflection.
  2. A Module base class with two responsibilities:
  3. When you do self.W1 = Parameter(...) inside Module.__init__, the base class intercepts the assignment (via __setattr__) and registers W1 in an internal _parameters dict.
  4. module.parameters() walks the _parameters dict and recursively visits submodule _modules to yield every Parameter in the tree.

With these two pieces, the MLP becomes:

class TenseMLP(Module):
    def __init__(self):
        super().__init__()
        self.fc1 = Linear(23, 16)
        self.fc2 = Linear(16, 5)
    def forward(self, x):
        return self.fc2(self.fc1(x).relu())

model = TenseMLP()
optim = SGD(model.parameters(), lr=0.01)
for _ in range(100):
    optim.zero_grad()
    loss = cross_entropy(model(x), y)
    loss.backward()
    optim.step()

Five lines for the training loop, one definition for the model. No parameter list to maintain. Add a third layer — self.fc3 = Linear(5, 5) — and it's automatically picked up. That is what Module gives you.

Why __setattr__ and not __init__ magic?

Two alternatives to __setattr__ registration:

  • Alternative A — explicit register_parameter("W1", p). This works (PyTorch supports it as the underlying API). But it's verbose — every parameter needs two lines: p = Parameter(...) then self.register_parameter("W1", p). Most code wants the self.W1 = Parameter(...) shorthand.
  • Alternative B — __init_subclass__ introspection. Pythonically clever but hard to debug. You scan class attributes at definition time. Doesn't handle parameters created at instance time (e.g., embedding tables whose size depends on __init__ args).

PyTorch picked __setattr__ and the design has aged well. We copy it.

The mechanic in 20 lines (Lab 00 builds this):

class Module:
    def __init__(self):
        # Use object.__setattr__ to bypass our own __setattr__ — chicken-and-egg.
        object.__setattr__(self, "_parameters", {})
        object.__setattr__(self, "_modules", {})

    def __setattr__(self, name, value):
        if isinstance(value, Parameter):
            self._parameters[name] = value
        elif isinstance(value, Module):
            self._modules[name] = value
        object.__setattr__(self, name, value)

    def parameters(self):
        yield from self._parameters.values()
        for m in self._modules.values():
            yield from m.parameters()

That's the whole framework. Everything else (Linear, Sequential, Adam) sits on top of this.

🇪🇸 La idea más sutil: tener que llamar a super().__init__() antes de asignar el primer Parameter. Si lo olvidas, _parameters no existe todavía, y la primera asignación falla. Pyt orch tiene este mismo "gotcha" — es el primer error que todo principiante encuentra.

Parameter vs Tensor: the only difference

Parameter is not a richer class than Tensor. It is a Tensor with two things:

  1. requires_grad=True by default (because parameters are always learnable).
  2. It's an instance of the Parameter subclass, so isinstance(x, Parameter) distinguishes it from a regular Tensor — which is what __setattr__ needs.

That's it. No new ops, no new methods, no new state. The whole point of Parameter is to be a marker.

class Parameter(Tensor):
    def __init__(self, data, requires_grad=True):
        super().__init__(data, requires_grad=requires_grad)

(Two-line class. The work is entirely in Module.)

What Module gives you for free

Once registration works, several conveniences fall out:

  • Recursive parameter enumeration. model.parameters() yields every Parameter in the model tree.
  • Recursive zero_grad(). Walks the tree, sets p.grad = None on every Parameter.
  • Recursive state_dict(). Serializable representation: {"fc1.weight": ndarray, "fc1.bias": ndarray, ...}.
  • Recursive load_state_dict(state). Inverse of the above.
  • print(model) that walks the tree and emits something readable.
  • Forward as __call__. model(x) calls model.forward(x). Convention from PyTorch.

PyTorch's nn.Module adds more (hooks, train/eval mode, buffers, device transfer) — Phase 9 implements the minimum. Phase 10 adds train/eval; Phase 17 adds buffers (for positional embeddings); Phase 18 adds hooks (for gradient logging).

Why use the PyTorch convention?

Two reasons:

  1. Ergonomic transfer. The whole point of Phase 9's port drill (experiments/09-pytorch-port-drill/) is to confirm that the API is close enough to PyTorch that you can port a small script in 30 minutes. The closer the API, the smoother the transfer.
  2. Future bridges. Phase 24 imports PyTorch and uses real nn.Module. The mental model you build in Phase 9 must be the same one Phase 24 uses. If minimodel's Module worked totally differently, Phase 24's onboarding would cost extra cognitive effort.

We are not slavishly copying PyTorch — we omit much (device, dtype, hooks, buffers). But what we do implement matches PyTorch's API down to method names and parameter ordering.

Topic anchor (§A13)

The MLP we'll train at the end of this phase has input space (one-hot verb ⊕ one-hot person) — 23 dimensions — and output space (logits over 5 tenses). The model class is TenseMLP. Its submodules are fc1: Linear(23, 16) and fc2: Linear(16, 5). Its parameters() method must yield fc1.weight, fc1.bias, fc2.weight, fc2.bias in that order. The grammar grid (20 verbs × 3 persons × 5 tenses = 300 triples) is the source of training data.

By the end of Phase 9, Borja owns enough framework to write any small MLP — including the one that drives the §A13 conjugation tutor in Phase 32.

What this page does NOT cover

  • Module.train() / eval() modes. Stubbed as a no-op in Phase 9; Phase 10's BatchNorm activates them.
  • Buffers (non-learnable persistent state). Phase 11 (embeddings, positional encodings) needs them.
  • nn.functional vs nn.Module duality. PyTorch has both. We have a thin minitorch.functional (Phase 8) and minimodel.nn.* (Phase 9). The duality is real: nn.CrossEntropyLoss()(logits, targets) calls minitorch.cross_entropy(logits, targets) under the hood. Document at phase open.

One-paragraph recap

Module and Parameter are an ergonomics layer over Phase 8's Tensor. Parameter is a Tensor subclass with requires_grad=True and an isinstance marker. Module uses __setattr__ to intercept attribute assignment, register Parameters and submodules in dicts, and expose a recursive parameters() method. ~20 lines of cleverness; the rest is straightforward composition. PyTorch's API is copied closely because it has aged well and because the Phase 24 onboarding to real PyTorch will be smoother for it.


Next: 01-parameter-and-module.md