English · Español
02 — Linear and Sequential: the simplest possible layers¶
🇪🇸
Linearcabe en 15 líneas y es la única "capa" que el curriculum necesita hasta la Phase 13.Sequentialcabe en 10 líneas más y nos da la composición ergonómica. La única decisión real enLineares la inicialización de los pesos: Kaiming/He para preactivaciones que pasarán por ReLU; Xavier para tanh/sigmoid. Phase 9 implementa una versión mínima; Phase 10 la deriva en detalle.
Linear(in_features, out_features): the affine layer¶
The math is one line: y = x @ W.T + b (or y = x @ W + b — choose a convention and stick to it).
The class:
class Linear(Module):
"""Affine layer: y = x @ W.T + b."""
def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
super().__init__()
# Kaiming-uniform init (Phase 10 derives why this magnitude).
bound = 1.0 / math.sqrt(in_features)
self.weight = Parameter(
np.random.uniform(-bound, bound, size=(out_features, in_features))
)
if bias:
self.bias = Parameter(np.zeros(out_features))
else:
self.bias = None # registered as a non-Parameter, so not in self._parameters
def forward(self, x: Tensor) -> Tensor:
# x: (B, in_features) or (..., in_features)
# Returns: (B, out_features) or (..., out_features)
y = x @ self.weight.transpose((1, 0)) # (B, out)
if self.bias is not None:
y = y + self.bias # broadcast (out,) + (B, out) → (B, out)
return y
Twenty lines. Decisions baked in:
- PyTorch convention
(out, in)for weight shape, transposed in forward. Reason: makesLinear'sstate_dictkeys ("weight","bias") match PyTorch's so checkpoints transfer (Phase 18). Alternative is(in, out)storage with no transpose — saves an op but breaks PyTorch interop. - Bias defaults to zero, not random. Standard practice: biases initialized to zero are fine because the symmetry is broken by the random weights. Phase 10 confirms.
bias: boolflag. Optional bias is useful for the last layer of a softmax classifier (the bias is redundant under softmax's shift invariance). Phase 17's transformer has biased and bias-less linears.- Init magnitude
1/sqrt(in_features). Kaiming-uniform for ReLU. The full derivation lives in Phase 10'stheory/01-initialization.md; here we use the result as a sensible default.
Why does Linear accept (..., in_features)?¶
The forward uses @ which is NumPy's matmul. matmul broadcasts the leading dims: (B, T, D) @ (D, H) → (B, T, H). So Linear works on any rank-≥1 input, treating the last axis as the feature axis. This is critical for transformers (Phase 17 where input is (B, T, D)) — we get the right behavior for free.
Parameter count check¶
Linear(23, 16) has 23 · 16 + 16 = 384 parameters. Linear(16, 5) has 16 · 5 + 5 = 85. Total for TenseMLP: 469. Fits in 4 KB at FP32. Sanity-checking parameter counts becomes a reflex from Phase 9 onward.
Sequential([m1, m2, ...]): composition¶
class Sequential(Module):
"""Apply modules in order. Auto-registers each module."""
def __init__(self, *modules: Module) -> None:
super().__init__()
for i, m in enumerate(modules):
# Register each module under its index as a string.
setattr(self, str(i), m)
def forward(self, x):
for i in range(len(self._modules)):
x = self._modules[str(i)](x)
return x
Ten lines. Notes:
- Modules are registered under string indices (
"0","1", ...). PyTorch does this too —model.0.weightis the first submodule's weight. State dicts look like"0.weight","0.bias","2.weight", ... (with activations at indices 1 and 3 contributing no parameters). - Forward iterates over
_modulesin registration order. Python 3.7+ dicts preserve insertion order, so this works without an explicitOrderedDict. - No
forward(self, *args).Sequentialchains a single tensor through. Multi-input layers don't fit; use a customModulesubclass for those.
Composition example:
mlp = Sequential(
Linear(23, 16),
ReLU(),
Linear(16, 5),
)
# mlp.parameters() yields: fc1.weight, fc1.bias, fc2.weight, fc2.bias (4 tensors)
# mlp(x) computes Linear(ReLU(Linear(x)))
Activation modules¶
Thin wrappers around Tensor methods. Each has no parameters.
class ReLU(Module):
def forward(self, x: Tensor) -> Tensor:
return x.relu()
class Tanh(Module):
def forward(self, x: Tensor) -> Tensor:
return x.tanh()
class Sigmoid(Module):
def forward(self, x: Tensor) -> Tensor:
return Tensor(1.0) / (Tensor(1.0) + (-x).exp())
class GELU(Module):
def forward(self, x: Tensor) -> Tensor:
return x.gelu()
Three lines each. Why modules at all (not just functions)?
SequentialrequiresModules. Functions don't have.parameters()and don't fit into_modules.- PyTorch parity. Users coming from PyTorch expect
nn.ReLU()to work.
For non-Sequential use, x.relu() directly is equivalent and shorter. Both styles coexist.
Initialization: the minimum¶
Phase 9 implements two init helpers in nn/init.py:
def kaiming_uniform_(tensor: Tensor, a: float = 0.0) -> None:
"""Kaiming-uniform initialization for ReLU networks."""
fan_in = tensor.data.shape[1]
gain = math.sqrt(2.0 / (1 + a ** 2))
bound = gain * math.sqrt(3.0 / fan_in)
tensor.data[...] = np.random.uniform(-bound, bound, size=tensor.shape)
def xavier_normal_(tensor: Tensor, gain: float = 1.0) -> None:
"""Xavier-normal initialization for tanh/sigmoid networks."""
fan_in, fan_out = tensor.data.shape[1], tensor.data.shape[0]
std = gain * math.sqrt(2.0 / (fan_in + fan_out))
tensor.data[...] = np.random.normal(0.0, std, size=tensor.shape)
In-place (tensor.data[...] = ...) by convention — these helpers mutate the passed-in tensor, matching PyTorch's _ suffix. The lab implements Linear with kaiming_uniform_ as default.
Why does init magnitude matter?¶
If every weight is too large: pre-activations have variance ≫ 1, ReLU saturates (most outputs are linear in the input, but the gradient explodes through many layers). Loss is NaN within 10 steps.
If every weight is too small: pre-activations have variance ≪ 1, ReLU produces tiny outputs, gradients vanish through the layers. Loss is constant for many steps.
The Kaiming derivation (Phase 10) is: "for a ReLU layer, weight variance 2 / fan_in keeps preactivation variance equal to input variance". The factor of 2 corrects for ReLU killing half the outputs in expectation.
Phase 9 uses Kaiming as the default because every layer in the §A13 MLP is followed by a ReLU (except the output). Phase 10 expands; Phase 17 uses Xavier for tanh-internal transformer layers.
The training loop using Module¶
Putting it together:
model = Sequential(
Linear(23, 16),
ReLU(),
Linear(16, 5),
)
optim = SGD(list(model.parameters()), lr=0.05)
for epoch in range(n_epochs):
for x_batch, y_batch in train_batches:
optim.zero_grad()
logits = model(x_batch)
loss = cross_entropy(logits, y_batch)
loss.backward()
optim.step()
The five-line training loop that every neural net since 2015 has used. This is what Phase 9 delivers.
Pitfalls (will bite in lab)¶
Sequentialwith a non-Module item.Sequential(Linear(2, 3), lambda x: x.relu())raises in__init__(lambda isn't aModule). Error message should be loud and clear.Linear.weight.shapewrong.(out_features, in_features)— not(in, out). Off-by-one mistake during__init__produces matmul dim mismatch at first forward. Test:Linear(3, 5).weight.shape == (5, 3).- Forgetting
super().__init__()in a customModule. First line of every__init__after the def. Crash on firstself.X = Parameter(...)because_parametersdoesn't exist. - Initializing weights as
np.zeros. All neurons compute the same thing; gradients are identical; training never breaks symmetry. The loss decreases briefly (the bias still trains) then plateaus. Hard bug to spot because nothing crashes — just stuck at high loss. Sigmoidnumerical stability.(-x).exp()for very negativex(say-1000) overflows. PyTorch uses a stable form. Phase 9'sSigmoiduses the naive form because we don't need extreme inputs in the grammar task; Phase 18 (training tricks) revisits.- In-place mutation of
tensor.data. The init helperstensor.data[...] = ...are fine because they happen before the tensor enters any graph. Mutatingdataafter the tensor has been used in a forward pass is forbidden (breaks the DAG — Phase 8's anti-goal).
Topic anchor (§A13)¶
Phase 9's TenseMLP is:
class TenseMLP(Module):
def __init__(self):
super().__init__()
self.fc1 = Linear(23, 16) # one-hot(verb) ⊕ one-hot(person)
self.fc2 = Linear(16, 5) # logits over 5 tenses
def forward(self, x):
h = self.fc1(x).relu()
return self.fc2(h)
Equivalent in Sequential form:
Both train identically. Borja picks one in Lab 03; the lab's reflection question asks which feels more natural and why.
What this page does NOT cover¶
nn.Embedding. Phase 11 (embeddings). For now, we represent verbs/persons as one-hot vectors.nn.BatchNorm,nn.LayerNorm. Phase 10.nn.Dropout. Phase 18.- Multi-input forward (e.g.,
forward(x, mask)). Phase 14 (transformer attention). - The Kaiming derivation in detail. Phase 10's
theory/01-initialization.md.
One-paragraph recap¶
Linear(in, out) is ~20 lines: it stores weight: (out, in) and bias: (out,) as Parameters and computes x @ weight.T + bias in forward. Sequential(*modules) is ~10 lines: it registers each module under a string index and chains them in forward. Activations are 3-line Module wrappers around Tensor methods. Initialization defaults to Kaiming-uniform (Phase 10 derives). With these pieces, a complete MLP fits in 7 lines and the training loop fits in 5 lines. Phase 9's value-add is the ergonomics, not new math.
Next: 03-optimizers.md