Skip to content

English · Español

02 — Linear and Sequential: the simplest possible layers

🇪🇸 Linear cabe en 15 líneas y es la única "capa" que el curriculum necesita hasta la Phase 13. Sequential cabe en 10 líneas más y nos da la composición ergonómica. La única decisión real en Linear es la inicialización de los pesos: Kaiming/He para preactivaciones que pasarán por ReLU; Xavier para tanh/sigmoid. Phase 9 implementa una versión mínima; Phase 10 la deriva en detalle.


Linear(in_features, out_features): the affine layer

The math is one line: y = x @ W.T + b (or y = x @ W + b — choose a convention and stick to it).

The class:

class Linear(Module):
    """Affine layer: y = x @ W.T + b."""

    def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
        super().__init__()
        # Kaiming-uniform init (Phase 10 derives why this magnitude).
        bound = 1.0 / math.sqrt(in_features)
        self.weight = Parameter(
            np.random.uniform(-bound, bound, size=(out_features, in_features))
        )
        if bias:
            self.bias = Parameter(np.zeros(out_features))
        else:
            self.bias = None  # registered as a non-Parameter, so not in self._parameters

    def forward(self, x: Tensor) -> Tensor:
        # x: (B, in_features) or (..., in_features)
        # Returns: (B, out_features) or (..., out_features)
        y = x @ self.weight.transpose((1, 0))   # (B, out)
        if self.bias is not None:
            y = y + self.bias                     # broadcast (out,) + (B, out) → (B, out)
        return y

Twenty lines. Decisions baked in:

  1. PyTorch convention (out, in) for weight shape, transposed in forward. Reason: makes Linear's state_dict keys ("weight", "bias") match PyTorch's so checkpoints transfer (Phase 18). Alternative is (in, out) storage with no transpose — saves an op but breaks PyTorch interop.
  2. Bias defaults to zero, not random. Standard practice: biases initialized to zero are fine because the symmetry is broken by the random weights. Phase 10 confirms.
  3. bias: bool flag. Optional bias is useful for the last layer of a softmax classifier (the bias is redundant under softmax's shift invariance). Phase 17's transformer has biased and bias-less linears.
  4. Init magnitude 1/sqrt(in_features). Kaiming-uniform for ReLU. The full derivation lives in Phase 10's theory/01-initialization.md; here we use the result as a sensible default.

Why does Linear accept (..., in_features)?

The forward uses @ which is NumPy's matmul. matmul broadcasts the leading dims: (B, T, D) @ (D, H) → (B, T, H). So Linear works on any rank-≥1 input, treating the last axis as the feature axis. This is critical for transformers (Phase 17 where input is (B, T, D)) — we get the right behavior for free.

Parameter count check

Linear(23, 16) has 23 · 16 + 16 = 384 parameters. Linear(16, 5) has 16 · 5 + 5 = 85. Total for TenseMLP: 469. Fits in 4 KB at FP32. Sanity-checking parameter counts becomes a reflex from Phase 9 onward.

Sequential([m1, m2, ...]): composition

class Sequential(Module):
    """Apply modules in order. Auto-registers each module."""

    def __init__(self, *modules: Module) -> None:
        super().__init__()
        for i, m in enumerate(modules):
            # Register each module under its index as a string.
            setattr(self, str(i), m)

    def forward(self, x):
        for i in range(len(self._modules)):
            x = self._modules[str(i)](x)
        return x

Ten lines. Notes:

  • Modules are registered under string indices ("0", "1", ...). PyTorch does this too — model.0.weight is the first submodule's weight. State dicts look like "0.weight", "0.bias", "2.weight", ... (with activations at indices 1 and 3 contributing no parameters).
  • Forward iterates over _modules in registration order. Python 3.7+ dicts preserve insertion order, so this works without an explicit OrderedDict.
  • No forward(self, *args). Sequential chains a single tensor through. Multi-input layers don't fit; use a custom Module subclass for those.

Composition example:

mlp = Sequential(
    Linear(23, 16),
    ReLU(),
    Linear(16, 5),
)
# mlp.parameters() yields: fc1.weight, fc1.bias, fc2.weight, fc2.bias  (4 tensors)
# mlp(x) computes Linear(ReLU(Linear(x)))

Activation modules

Thin wrappers around Tensor methods. Each has no parameters.

class ReLU(Module):
    def forward(self, x: Tensor) -> Tensor:
        return x.relu()

class Tanh(Module):
    def forward(self, x: Tensor) -> Tensor:
        return x.tanh()

class Sigmoid(Module):
    def forward(self, x: Tensor) -> Tensor:
        return Tensor(1.0) / (Tensor(1.0) + (-x).exp())

class GELU(Module):
    def forward(self, x: Tensor) -> Tensor:
        return x.gelu()

Three lines each. Why modules at all (not just functions)?

  • Sequential requires Modules. Functions don't have .parameters() and don't fit into _modules.
  • PyTorch parity. Users coming from PyTorch expect nn.ReLU() to work.

For non-Sequential use, x.relu() directly is equivalent and shorter. Both styles coexist.

Initialization: the minimum

Phase 9 implements two init helpers in nn/init.py:

def kaiming_uniform_(tensor: Tensor, a: float = 0.0) -> None:
    """Kaiming-uniform initialization for ReLU networks."""
    fan_in = tensor.data.shape[1]
    gain = math.sqrt(2.0 / (1 + a ** 2))
    bound = gain * math.sqrt(3.0 / fan_in)
    tensor.data[...] = np.random.uniform(-bound, bound, size=tensor.shape)

def xavier_normal_(tensor: Tensor, gain: float = 1.0) -> None:
    """Xavier-normal initialization for tanh/sigmoid networks."""
    fan_in, fan_out = tensor.data.shape[1], tensor.data.shape[0]
    std = gain * math.sqrt(2.0 / (fan_in + fan_out))
    tensor.data[...] = np.random.normal(0.0, std, size=tensor.shape)

In-place (tensor.data[...] = ...) by convention — these helpers mutate the passed-in tensor, matching PyTorch's _ suffix. The lab implements Linear with kaiming_uniform_ as default.

Why does init magnitude matter?

If every weight is too large: pre-activations have variance ≫ 1, ReLU saturates (most outputs are linear in the input, but the gradient explodes through many layers). Loss is NaN within 10 steps.

If every weight is too small: pre-activations have variance ≪ 1, ReLU produces tiny outputs, gradients vanish through the layers. Loss is constant for many steps.

The Kaiming derivation (Phase 10) is: "for a ReLU layer, weight variance 2 / fan_in keeps preactivation variance equal to input variance". The factor of 2 corrects for ReLU killing half the outputs in expectation.

Phase 9 uses Kaiming as the default because every layer in the §A13 MLP is followed by a ReLU (except the output). Phase 10 expands; Phase 17 uses Xavier for tanh-internal transformer layers.

The training loop using Module

Putting it together:

model = Sequential(
    Linear(23, 16),
    ReLU(),
    Linear(16, 5),
)
optim = SGD(list(model.parameters()), lr=0.05)

for epoch in range(n_epochs):
    for x_batch, y_batch in train_batches:
        optim.zero_grad()
        logits = model(x_batch)
        loss = cross_entropy(logits, y_batch)
        loss.backward()
        optim.step()

The five-line training loop that every neural net since 2015 has used. This is what Phase 9 delivers.

Pitfalls (will bite in lab)

  1. Sequential with a non-Module item. Sequential(Linear(2, 3), lambda x: x.relu()) raises in __init__ (lambda isn't a Module). Error message should be loud and clear.
  2. Linear.weight.shape wrong. (out_features, in_features)not (in, out). Off-by-one mistake during __init__ produces matmul dim mismatch at first forward. Test: Linear(3, 5).weight.shape == (5, 3).
  3. Forgetting super().__init__() in a custom Module. First line of every __init__ after the def. Crash on first self.X = Parameter(...) because _parameters doesn't exist.
  4. Initializing weights as np.zeros. All neurons compute the same thing; gradients are identical; training never breaks symmetry. The loss decreases briefly (the bias still trains) then plateaus. Hard bug to spot because nothing crashes — just stuck at high loss.
  5. Sigmoid numerical stability. (-x).exp() for very negative x (say -1000) overflows. PyTorch uses a stable form. Phase 9's Sigmoid uses the naive form because we don't need extreme inputs in the grammar task; Phase 18 (training tricks) revisits.
  6. In-place mutation of tensor.data. The init helpers tensor.data[...] = ... are fine because they happen before the tensor enters any graph. Mutating data after the tensor has been used in a forward pass is forbidden (breaks the DAG — Phase 8's anti-goal).

Topic anchor (§A13)

Phase 9's TenseMLP is:

class TenseMLP(Module):
    def __init__(self):
        super().__init__()
        self.fc1 = Linear(23, 16)   # one-hot(verb) ⊕ one-hot(person)
        self.fc2 = Linear(16, 5)    # logits over 5 tenses

    def forward(self, x):
        h = self.fc1(x).relu()
        return self.fc2(h)

Equivalent in Sequential form:

mlp = Sequential(
    Linear(23, 16),
    ReLU(),
    Linear(16, 5),
)

Both train identically. Borja picks one in Lab 03; the lab's reflection question asks which feels more natural and why.

What this page does NOT cover

  • nn.Embedding. Phase 11 (embeddings). For now, we represent verbs/persons as one-hot vectors.
  • nn.BatchNorm, nn.LayerNorm. Phase 10.
  • nn.Dropout. Phase 18.
  • Multi-input forward (e.g., forward(x, mask)). Phase 14 (transformer attention).
  • The Kaiming derivation in detail. Phase 10's theory/01-initialization.md.

One-paragraph recap

Linear(in, out) is ~20 lines: it stores weight: (out, in) and bias: (out,) as Parameters and computes x @ weight.T + bias in forward. Sequential(*modules) is ~10 lines: it registers each module under a string index and chains them in forward. Activations are 3-line Module wrappers around Tensor methods. Initialization defaults to Kaiming-uniform (Phase 10 derives). With these pieces, a complete MLP fits in 7 lines and the training loop fits in 5 lines. Phase 9's value-add is the ergonomics, not new math.


Next: 03-optimizers.md