English · Español
01 — Parameter and Module: the registration mechanic¶
🇪🇸 El truco de toda la fase cabe en treinta líneas.
Parameteres un marcador (Tensorconrequires_grad=Truey nada más);Moduleusa__setattr__para detectar cuándo le asignas unParametero un sub-Moduley los guarda en diccionarios internos. El métodoparameters()recorre esos diccionarios recursivamente. Una vez tienes este esqueleto,Linear,Sequential,Adamy cualquier red profunda son extensiones triviales.
The five-line Parameter class¶
class Parameter(Tensor):
"""A Tensor that is an owned, learnable weight of a Module."""
def __init__(self, data, requires_grad: bool = True) -> None:
super().__init__(data, requires_grad=requires_grad)
Why bother with a subclass instead of just using Tensor(..., requires_grad=True)?
isinstancedistinguishes them.Module.__setattr__usesisinstance(value, Parameter)to decide what to register. If we used plainTensors, we'd have to checkrequires_grad=True— but a forward intermediate (out = self.fc1(x)) hasrequires_grad=Truetoo because it inherits from a parameter parent. We need a way to say "this specific tensor is a root learnable weight". Subclassing is the simplest discriminator.- API clarity. When a user writes
self.W = Parameter(rng.standard_normal(shape)), it's immediately clear this is a weight, not an activation.
PyTorch uses the same pattern: torch.nn.Parameter is a thin torch.Tensor subclass.
The Module base class¶
class Module:
"""Base class for all neural network modules."""
def __init__(self) -> None:
# Initialize the registration dicts via object.__setattr__ to avoid
# infinite recursion in our own __setattr__.
object.__setattr__(self, "_parameters", {})
object.__setattr__(self, "_modules", {})
object.__setattr__(self, "training", True)
def __setattr__(self, name: str, value: Any) -> None:
# If the new value is a Parameter, register it.
if isinstance(value, Parameter):
# If we're overwriting an existing Parameter or submodule, drop the old registration.
self._parameters.pop(name, None)
self._modules.pop(name, None)
self._parameters[name] = value
elif isinstance(value, Module):
self._parameters.pop(name, None)
self._modules.pop(name, None)
self._modules[name] = value
else:
# If we're overwriting something previously registered, drop the registration.
self._parameters.pop(name, None)
self._modules.pop(name, None)
# In all cases, still set the attribute on the instance.
object.__setattr__(self, name, value)
def parameters(self) -> Iterator[Parameter]:
"""Yield all Parameters in this module and its submodules, depth-first."""
yield from self._parameters.values()
for submodule in self._modules.values():
yield from submodule.parameters()
def zero_grad(self) -> None:
"""Set the gradient of every parameter to None (lazy convention)."""
for p in self.parameters():
p.grad = None
def forward(self, *args, **kwargs):
raise NotImplementedError
def __call__(self, *args, **kwargs):
return self.forward(*args, **kwargs)
Twenty-five lines. Walk through what each part buys you.
Why object.__setattr__ in __init__?¶
If __init__ did self._parameters = {}, that would call our own __setattr__. But at that moment self._parameters doesn't exist yet, so self._parameters.pop(name, None) (inside __setattr__) would crash. Chicken-and-egg. Bypass our own override by calling object.__setattr__ directly.
This is the single bit of Python magic in the whole module. Once past __init__, ordinary attribute syntax works.
Why pop both _parameters and _modules before assigning?¶
Suppose a user does:
self.layer = Linear(2, 3) # registers in _modules
self.layer = Parameter(rng.randn(2, 3)) # now we want it in _parameters, not _modules
If we don't pop the old registration, both dicts contain "layer", and parameters() walks the (now-stale) submodule. Always pop both to keep the dicts consistent.
Why does parameters() recurse via submodule.parameters()?¶
To handle arbitrarily deep nesting:
class Block(Module):
def __init__(self):
super().__init__()
self.fc = Linear(10, 10)
class Net(Module):
def __init__(self):
super().__init__()
self.block1 = Block()
self.block2 = Block()
Net().parameters() should yield 4 tensors (2 weights, 2 biases). The recursion is Net → block1.parameters() → fc.parameters() → [W, b], and same for block2.
Why is zero_grad() p.grad = None?¶
Lazy convention from Phase 8. Setting to None causes the next _backward to allocate fresh. The alternative (np.zeros_like) is a memory-eager mode useful in some long-running training loops; we'll switch to it in Phase 18 if profiling shows allocation overhead matters.
Why forward() raises NotImplementedError?¶
It forces subclasses to define a forward. The pattern is:
class TenseMLP(Module):
def __init__(self):
super().__init__()
self.fc1 = Linear(23, 16)
self.fc2 = Linear(16, 5)
def forward(self, x):
return self.fc2(self.fc1(x).relu())
model(x) calls model.__call__(x) which calls model.forward(x). PyTorch does this so it can intercept __call__ for hooks (Phase 18). Phase 9's Module doesn't need hooks, but we keep the indirection for API parity.
state_dict and load_state_dict — the serialization contract¶
The Phase 9 spec doesn't mandate full checkpointing, but the API surface should anticipate it. Sketch:
class Module:
def state_dict(self, prefix: str = "") -> dict[str, np.ndarray]:
"""Return a flat dict of named parameter arrays."""
out = {}
for name, p in self._parameters.items():
out[prefix + name] = p.data
for name, m in self._modules.items():
out.update(m.state_dict(prefix + name + "."))
return out
def load_state_dict(self, state: dict[str, np.ndarray], prefix: str = "") -> None:
"""Load arrays into our parameters (by name)."""
for name, p in self._parameters.items():
if prefix + name in state:
p.data[...] = state[prefix + name] # in-place copy into the Parameter's array
for name, m in self._modules.items():
m.load_state_dict(state, prefix + name + ".")
Key choices:
- Flat dot-separated keys (
"fc1.weight", not nested dicts). Matches PyTorch. - In-place data copy in
load_state_dict. Preserves the Parameter object identity — the optimizer's reference to that Parameter still works after loading. - Serialization format. Pickle is the spec antigoal (§5.1 of
LYNX_CORTEX.md). Usesafetensorsfrom day one for actual disk persistence. Thestate_dict()method returns a plain dict;safetensors.save_file(model.state_dict(), "path.safetensors")does the I/O. Loading:model.load_state_dict(safetensors.load_file("path.safetensors")).
The __repr__ for debugging¶
def __repr__(self) -> str:
lines = [self.__class__.__name__ + "("]
for name, m in self._modules.items():
sub_repr = repr(m).replace("\n", "\n ")
lines.append(f" ({name}): {sub_repr}")
for name, p in self._parameters.items():
lines.append(f" ({name}): Parameter(shape={p.shape})")
lines.append(")")
return "\n".join(lines)
Calling print(model) should produce:
TenseMLP(
(fc1): Linear(
(weight): Parameter(shape=(16, 23))
(bias): Parameter(shape=(16,))
)
(fc2): Linear(
(weight): Parameter(shape=(5, 16))
(bias): Parameter(shape=(5,))
)
)
Useful for sanity-checking parameter shapes before training. PyTorch's print(model) looks essentially identical.
Edge cases to test¶
- A
Modulewith no parameters.Module().parameters()yields nothing. Doesn't raise. - A
Modulewhose forward never calls any submodule (degenerate).parameters()still yields the registered ones. - A
Modulethat registers the sameParametertwice under different names (shared parameters — tied embeddings).parameters()yields the same object twice. That's PyTorch's behavior too — but it's a footgun: the optimizer applies the update twice. We document this; Phase 17 may add a deduplication step. - A list of
Parameters.self.weights = [Parameter(...), Parameter(...)]does not register either parameter (the list isn't aParameterorModule). To fix, PyTorch hasParameterList. Phase 9 documents the gotcha; defersParameterListto Phase 14 (if it's needed for multi-head attention). - Reassigning to
None.self.fc1 = Nonedoes not raise, but it un-registers the previous module (the pop happens). Subtle behavior — test it.
The hashable-by-identity invariant (from Phase 8)¶
Parameter inherits __hash__ from Tensor (default identity). The optimizer uses id(p) implicitly via Python's set/dict semantics. Do not override Parameter.__eq__ for elementwise comparison — it would break parameters() collection if you ever store them in a set.
Pitfalls (will bite in lab)¶
- Forgetting
super().__init__(). First line of everyModulesubclass__init__. Without it,_parametersdoesn't exist, and the firstself.W = Parameter(...)crashes. - Assigning a plain
Tensorthinking it'll be a parameter. It won't —__setattr__checksisinstance(value, Parameter). Test:model = MyModel(); list(model.parameters())should be exactly the expected count. - Capturing the iterator instead of the list.
params = model.parameters()is a generator; once exhausted, it's empty. The optimizer accepts a list. Cast:optim = SGD(list(model.parameters()), lr=0.01). - Re-creating a layer in
forward().def forward(self, x): self.fc = Linear(...); return self.fc(x)creates a freshLinear(with random weights) every forward pass. The Parameter is registered, but the new one each time. Training never works. Hard bug because the loss does change between forward passes — just to random values. - Calling
model.forward(x)instead ofmodel(x). Works in Phase 9 (no hooks). Will silently bypass hooks in Phase 18.
Topic anchor (§A13)¶
The Phase 9 capstone TenseMLP(Module) has exactly two submodules (fc1, fc2) and four parameters total. model.parameters() must yield them in this order: fc1.weight, fc1.bias, fc2.weight, fc2.bias. Verify with a test before training. If the order is wrong, the optimizer state (m, v for Adam) is associated with the wrong parameter — Adam still "trains" but the moment estimates are misaligned. Hard-to-diagnose silent bug.
What this page does NOT cover¶
Module.train()/eval()modes. Stubbed as a no-op in Lab 00; activated in Phase 10.- Buffers (non-learnable persistent state). Phase 11.
- Hooks (
register_forward_hook). Phase 18.
One-paragraph recap¶
Parameter is a marker subclass of Tensor. Module.__init__ initializes two dicts (_parameters and _modules) via object.__setattr__ to avoid recursion. Module.__setattr__ inspects every assigned value: Parameters land in _parameters, Modules in _modules, anything else is plain. parameters() walks the two dicts recursively. zero_grad, state_dict, load_state_dict, and __repr__ all use the same walk. ~25 lines of code; the rest of the framework (Linear, Sequential, optimizers) sits cleanly on top.