Skip to content

English · Español

01 — Parameter and Module: the registration mechanic

🇪🇸 El truco de toda la fase cabe en treinta líneas. Parameter es un marcador (Tensor con requires_grad=True y nada más); Module usa __setattr__ para detectar cuándo le asignas un Parameter o un sub-Module y los guarda en diccionarios internos. El método parameters() recorre esos diccionarios recursivamente. Una vez tienes este esqueleto, Linear, Sequential, Adam y cualquier red profunda son extensiones triviales.


The five-line Parameter class

class Parameter(Tensor):
    """A Tensor that is an owned, learnable weight of a Module."""
    def __init__(self, data, requires_grad: bool = True) -> None:
        super().__init__(data, requires_grad=requires_grad)

Why bother with a subclass instead of just using Tensor(..., requires_grad=True)?

  • isinstance distinguishes them. Module.__setattr__ uses isinstance(value, Parameter) to decide what to register. If we used plain Tensors, we'd have to check requires_grad=True — but a forward intermediate (out = self.fc1(x)) has requires_grad=True too because it inherits from a parameter parent. We need a way to say "this specific tensor is a root learnable weight". Subclassing is the simplest discriminator.
  • API clarity. When a user writes self.W = Parameter(rng.standard_normal(shape)), it's immediately clear this is a weight, not an activation.

PyTorch uses the same pattern: torch.nn.Parameter is a thin torch.Tensor subclass.

The Module base class

class Module:
    """Base class for all neural network modules."""

    def __init__(self) -> None:
        # Initialize the registration dicts via object.__setattr__ to avoid
        # infinite recursion in our own __setattr__.
        object.__setattr__(self, "_parameters", {})
        object.__setattr__(self, "_modules", {})
        object.__setattr__(self, "training", True)

    def __setattr__(self, name: str, value: Any) -> None:
        # If the new value is a Parameter, register it.
        if isinstance(value, Parameter):
            # If we're overwriting an existing Parameter or submodule, drop the old registration.
            self._parameters.pop(name, None)
            self._modules.pop(name, None)
            self._parameters[name] = value
        elif isinstance(value, Module):
            self._parameters.pop(name, None)
            self._modules.pop(name, None)
            self._modules[name] = value
        else:
            # If we're overwriting something previously registered, drop the registration.
            self._parameters.pop(name, None)
            self._modules.pop(name, None)
        # In all cases, still set the attribute on the instance.
        object.__setattr__(self, name, value)

    def parameters(self) -> Iterator[Parameter]:
        """Yield all Parameters in this module and its submodules, depth-first."""
        yield from self._parameters.values()
        for submodule in self._modules.values():
            yield from submodule.parameters()

    def zero_grad(self) -> None:
        """Set the gradient of every parameter to None (lazy convention)."""
        for p in self.parameters():
            p.grad = None

    def forward(self, *args, **kwargs):
        raise NotImplementedError

    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)

Twenty-five lines. Walk through what each part buys you.

Why object.__setattr__ in __init__?

If __init__ did self._parameters = {}, that would call our own __setattr__. But at that moment self._parameters doesn't exist yet, so self._parameters.pop(name, None) (inside __setattr__) would crash. Chicken-and-egg. Bypass our own override by calling object.__setattr__ directly.

This is the single bit of Python magic in the whole module. Once past __init__, ordinary attribute syntax works.

Why pop both _parameters and _modules before assigning?

Suppose a user does:

self.layer = Linear(2, 3)  # registers in _modules
self.layer = Parameter(rng.randn(2, 3))  # now we want it in _parameters, not _modules

If we don't pop the old registration, both dicts contain "layer", and parameters() walks the (now-stale) submodule. Always pop both to keep the dicts consistent.

Why does parameters() recurse via submodule.parameters()?

To handle arbitrarily deep nesting:

class Block(Module):
    def __init__(self):
        super().__init__()
        self.fc = Linear(10, 10)

class Net(Module):
    def __init__(self):
        super().__init__()
        self.block1 = Block()
        self.block2 = Block()

Net().parameters() should yield 4 tensors (2 weights, 2 biases). The recursion is Net → block1.parameters() → fc.parameters() → [W, b], and same for block2.

Why is zero_grad() p.grad = None?

Lazy convention from Phase 8. Setting to None causes the next _backward to allocate fresh. The alternative (np.zeros_like) is a memory-eager mode useful in some long-running training loops; we'll switch to it in Phase 18 if profiling shows allocation overhead matters.

Why forward() raises NotImplementedError?

It forces subclasses to define a forward. The pattern is:

class TenseMLP(Module):
    def __init__(self):
        super().__init__()
        self.fc1 = Linear(23, 16)
        self.fc2 = Linear(16, 5)

    def forward(self, x):
        return self.fc2(self.fc1(x).relu())

model(x) calls model.__call__(x) which calls model.forward(x). PyTorch does this so it can intercept __call__ for hooks (Phase 18). Phase 9's Module doesn't need hooks, but we keep the indirection for API parity.

state_dict and load_state_dict — the serialization contract

The Phase 9 spec doesn't mandate full checkpointing, but the API surface should anticipate it. Sketch:

class Module:
    def state_dict(self, prefix: str = "") -> dict[str, np.ndarray]:
        """Return a flat dict of named parameter arrays."""
        out = {}
        for name, p in self._parameters.items():
            out[prefix + name] = p.data
        for name, m in self._modules.items():
            out.update(m.state_dict(prefix + name + "."))
        return out

    def load_state_dict(self, state: dict[str, np.ndarray], prefix: str = "") -> None:
        """Load arrays into our parameters (by name)."""
        for name, p in self._parameters.items():
            if prefix + name in state:
                p.data[...] = state[prefix + name]  # in-place copy into the Parameter's array
        for name, m in self._modules.items():
            m.load_state_dict(state, prefix + name + ".")

Key choices:

  • Flat dot-separated keys ("fc1.weight", not nested dicts). Matches PyTorch.
  • In-place data copy in load_state_dict. Preserves the Parameter object identity — the optimizer's reference to that Parameter still works after loading.
  • Serialization format. Pickle is the spec antigoal (§5.1 of LYNX_CORTEX.md). Use safetensors from day one for actual disk persistence. The state_dict() method returns a plain dict; safetensors.save_file(model.state_dict(), "path.safetensors") does the I/O. Loading: model.load_state_dict(safetensors.load_file("path.safetensors")).

The __repr__ for debugging

def __repr__(self) -> str:
    lines = [self.__class__.__name__ + "("]
    for name, m in self._modules.items():
        sub_repr = repr(m).replace("\n", "\n  ")
        lines.append(f"  ({name}): {sub_repr}")
    for name, p in self._parameters.items():
        lines.append(f"  ({name}): Parameter(shape={p.shape})")
    lines.append(")")
    return "\n".join(lines)

Calling print(model) should produce:

TenseMLP(
  (fc1): Linear(
    (weight): Parameter(shape=(16, 23))
    (bias): Parameter(shape=(16,))
  )
  (fc2): Linear(
    (weight): Parameter(shape=(5, 16))
    (bias): Parameter(shape=(5,))
  )
)

Useful for sanity-checking parameter shapes before training. PyTorch's print(model) looks essentially identical.

Edge cases to test

  1. A Module with no parameters. Module().parameters() yields nothing. Doesn't raise.
  2. A Module whose forward never calls any submodule (degenerate). parameters() still yields the registered ones.
  3. A Module that registers the same Parameter twice under different names (shared parameters — tied embeddings).
    shared = Parameter(...)
    self.in_embedding = shared
    self.out_embedding = shared
    
    parameters() yields the same object twice. That's PyTorch's behavior too — but it's a footgun: the optimizer applies the update twice. We document this; Phase 17 may add a deduplication step.
  4. A list of Parameters. self.weights = [Parameter(...), Parameter(...)] does not register either parameter (the list isn't a Parameter or Module). To fix, PyTorch has ParameterList. Phase 9 documents the gotcha; defers ParameterList to Phase 14 (if it's needed for multi-head attention).
  5. Reassigning to None. self.fc1 = None does not raise, but it un-registers the previous module (the pop happens). Subtle behavior — test it.

The hashable-by-identity invariant (from Phase 8)

Parameter inherits __hash__ from Tensor (default identity). The optimizer uses id(p) implicitly via Python's set/dict semantics. Do not override Parameter.__eq__ for elementwise comparison — it would break parameters() collection if you ever store them in a set.

Pitfalls (will bite in lab)

  1. Forgetting super().__init__(). First line of every Module subclass __init__. Without it, _parameters doesn't exist, and the first self.W = Parameter(...) crashes.
  2. Assigning a plain Tensor thinking it'll be a parameter. It won't — __setattr__ checks isinstance(value, Parameter). Test: model = MyModel(); list(model.parameters()) should be exactly the expected count.
  3. Capturing the iterator instead of the list. params = model.parameters() is a generator; once exhausted, it's empty. The optimizer accepts a list. Cast: optim = SGD(list(model.parameters()), lr=0.01).
  4. Re-creating a layer in forward(). def forward(self, x): self.fc = Linear(...); return self.fc(x) creates a fresh Linear (with random weights) every forward pass. The Parameter is registered, but the new one each time. Training never works. Hard bug because the loss does change between forward passes — just to random values.
  5. Calling model.forward(x) instead of model(x). Works in Phase 9 (no hooks). Will silently bypass hooks in Phase 18.

Topic anchor (§A13)

The Phase 9 capstone TenseMLP(Module) has exactly two submodules (fc1, fc2) and four parameters total. model.parameters() must yield them in this order: fc1.weight, fc1.bias, fc2.weight, fc2.bias. Verify with a test before training. If the order is wrong, the optimizer state (m, v for Adam) is associated with the wrong parameter — Adam still "trains" but the moment estimates are misaligned. Hard-to-diagnose silent bug.

What this page does NOT cover

  • Module.train() / eval() modes. Stubbed as a no-op in Lab 00; activated in Phase 10.
  • Buffers (non-learnable persistent state). Phase 11.
  • Hooks (register_forward_hook). Phase 18.

One-paragraph recap

Parameter is a marker subclass of Tensor. Module.__init__ initializes two dicts (_parameters and _modules) via object.__setattr__ to avoid recursion. Module.__setattr__ inspects every assigned value: Parameters land in _parameters, Modules in _modules, anything else is plain. parameters() walks the two dicts recursively. zero_grad, state_dict, load_state_dict, and __repr__ all use the same walk. ~25 lines of code; the rest of the framework (Linear, Sequential, optimizers) sits cleanly on top.


Next: 02-linear-and-sequential.md