Skip to content

English · Español

Lab 01 — Linear and activation modules

Goal: implement Linear, ReLU, Tanh, Sigmoid, and Softmax as subclasses of the Module you built in Lab 00. Each is a thin wrapper around minitorch.Tensor ops; together they give you everything needed for the MLP in Lab 03. ~50 LOC by Borja.

Estimated time: 90–120 minutes.

Prereqs: Lab 00 closed (Parameter and Module green). Theory 02 read.


🇪🇸 Una vez que Module descubre parámetros, una Linear cabe en quince líneas y las activaciones en cuatro cada una. La única decisión interesante es la inicialización (Kaiming-uniform por defecto, Phase 10 la derivará) y la estabilidad numérica del softmax (restar el máximo antes del exp). Todo lo demás es plomería sobre Tensor.

What you produce

A new set of files in src/minimodel/nn/:

  • linear.pyLinear(in_features, out_features, bias=True).
  • activations.pyReLU, Tanh, Sigmoid, Softmax(dim=-1).
  • __init__.py re-exports updated to include the new symbols.

And tests/test_linear_and_activations.py with shape, gradcheck, and state-dict tests.

TODOs

Block A — Linear

In src/minimodel/nn/linear.py:

  • Import math, numpy as np, Tensor from minitorch, and Parameter, Module from .module.
  • Define Linear(Module):
class Linear(Module):
    """Affine layer: y = x @ W.T + b. Stores weight as (out, in) per PyTorch."""

    def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        # Kaiming-uniform default; Phase 10 will derive the magnitude.
        bound = 1.0 / math.sqrt(in_features)
        # TODO: initialize self.weight as Parameter of shape (out_features, in_features)
        #       uniform in [-bound, +bound].
        raise NotImplementedError
        # TODO: if bias, set self.bias = Parameter(np.zeros(out_features));
        #       else set self.bias = None (do NOT register as Parameter).

    def forward(self, x: Tensor) -> Tensor:
        # Input x has shape (..., in_features). Output has shape (..., out_features).
        # TODO: y = x @ self.weight.transpose((1, 0))
        # TODO: if self.bias is not None, y = y + self.bias
        raise NotImplementedError
  • Note the bias-less case: self.bias = None must keep Module.__setattr__ from registering it (it isn't a Parameter or a Module, so it falls through to plain attribute storage).

Block B — ReLU, Tanh, Sigmoid

In src/minimodel/nn/activations.py:

  • Each is a one-method Module. No parameters; parameters() yields nothing.
class ReLU(Module):
    def forward(self, x: Tensor) -> Tensor:
        # TODO: return x.relu() (or x.maximum(0) if relu() lives elsewhere).
        raise NotImplementedError


class Tanh(Module):
    def forward(self, x: Tensor) -> Tensor:
        # TODO: return x.tanh()
        raise NotImplementedError


class Sigmoid(Module):
    def forward(self, x: Tensor) -> Tensor:
        # TODO: return x.sigmoid()
        raise NotImplementedError

If minitorch.Tensor lacks any of relu, tanh, sigmoid, note the gap in your journal and add a TODO in minitorch — do NOT implement the op inside activations.py. Activations are modules; the math lives in the tensor library.

Block C — Softmax

  • Softmax(dim: int = -1) needs the subtract-max trick for numerical stability:
class Softmax(Module):
    def __init__(self, dim: int = -1) -> None:
        super().__init__()
        self.dim = dim

    def forward(self, x: Tensor) -> Tensor:
        # Numerical-stability trick: subtract the per-row max BEFORE exp.
        # x_shifted = x - x.max(dim=self.dim, keepdim=True)
        # exp_x    = x_shifted.exp()
        # return exp_x / exp_x.sum(dim=self.dim, keepdim=True)
        raise NotImplementedError
  • The "subtract max" is not an optimization — it prevents exp(large_number) from overflowing FP32 (≈ 88.7 is the ceiling). Without it, a logit of 100 returns inf and the gradient is nan.

Block D — __init__.py re-exports

In src/minimodel/nn/__init__.py:

  • Add from .linear import Linear.
  • Add from .activations import ReLU, Tanh, Sigmoid, Softmax.
  • Keep __all__ sorted and explicit.

Tests

In tests/test_linear_and_activations.py:

Block E — Linear shape and parameters

  • test_linear_shapes_batched:

    layer = Linear(4, 3)
    x = Tensor(np.random.randn(2, 4))   # (B=2, in=4)
    y = layer(x)
    assert y.data.shape == (2, 3)
    

  • test_linear_parameters_count:

    layer = Linear(4, 3)
    params = list(layer.parameters())
    assert len(params) == 2                     # weight, bias
    assert layer.weight.data.shape == (3, 4)    # (out, in)
    assert layer.bias.data.shape == (3,)
    

  • test_linear_no_bias:

    layer = Linear(4, 3, bias=False)
    params = list(layer.parameters())
    assert len(params) == 1
    assert layer.bias is None
    

  • test_linear_higher_rank_input: Input shape (B, T, in) → output (B, T, out). Verifies that the transpose + matmul broadcasts on the trailing dim.

Block F — Gradcheck

  • test_relu_gradcheck: Numeric finite-difference gradient vs autograd for ReLU. Sample 10 random inputs in [-1, 1]; finite-diff (f(x+ε) - f(x-ε)) / (2ε) with ε = 1e-4; compare to x.grad from y.sum().backward(). Tolerance 1e-4. Skip points where |x| < ε — ReLU is non-differentiable at 0.

  • test_tanh_gradcheck, test_sigmoid_gradcheck: same shape, no skip.

  • test_softmax_gradcheck: Softmax + sum is differentiable everywhere. Use a 4-dim input. Tolerance 1e-4.

  • test_linear_gradcheck: Gradcheck on both weight and bias for a Linear(2, 3) with a (2, 2) input batch.

Block G — Softmax numerical stability

  • test_softmax_large_logits_no_nan:

    x = Tensor(np.array([[100.0, 100.5, 101.0]]))
    y = Softmax(dim=-1)(x)
    assert not np.isnan(y.data).any()
    assert np.allclose(y.data.sum(axis=-1), 1.0)
    
    Without the subtract-max trick this test fails with nan.

  • test_softmax_sums_to_one: For a random (4, 7) input, every row of the output sums to 1 within 1e-6.

Block H — Serialization round-trip

  • test_linear_state_dict_keys:

    state = Linear(4, 3).state_dict()
    assert set(state.keys()) == {"weight", "bias"}
    

  • test_linear_load_state_dict_roundtrip: Build two Linear(4, 3) with different random seeds; copy state from one to the other; assert the second's weight.data and bias.data are element-wise equal to the first's.

  • test_sequential_of_linear_activation: Construct (via a tiny ad-hoc Sequential-like list, or with the real Sequential if Block I is done) a stack Linear(4, 8) → ReLU → Linear(8, 3). Verify it has 4 parameters in order: weight, bias, weight, bias. Forward pass on (2, 4) returns (2, 3).

Constraints

  • No PyTorch in src/minimodel/. PyTorch reference values (if any) live in test fixtures only, and only for cross-checks — Lab 02 uses one such cross-check, not this lab.
  • No new ops in Tensor. If minitorch is missing relu/tanh/sigmoid/max/exp/sum, fix minitorch first and write the missing-op note in your journal.
  • A13 scope unchanged. No verb-specific code in this lab; activations and Linear are domain-agnostic.
  • No nn.Module from PyTorch sneaking in. Our Module is from minimodel.nn.module.

Pitfalls

  • In-place ops break autograd. x.data -= ... inside a forward pass corrupts the saved tensor and the backward pass returns wrong gradients. Always produce a new Tensor from ops.
  • Missing requires_grad propagation. If you wrap intermediate values in fresh Tensor(np.array(...)) calls (e.g. for the subtract-max trick), you can sever the autograd graph. Use tensor ops; do not construct new leaf tensors mid-forward.
  • Softmax without the max trick. exp(100.0) overflows FP32. Subtract the per-row max before exp.
  • Linear weight shape. PyTorch stores (out, in) and transposes in forward. We match that — checkpoints transfer to PyTorch in Phase 18.
  • Bias-less Linear registering None as a Parameter. Test: bias=False must leave _parameters without a "bias" key. Module.__setattr__ already handles this if you set self.bias = None; verify.
  • Activation __init__ forgetting super().__init__(). The Module base needs _parameters and _modules initialized, even for parameter-free activations. Without super().__init__(), list(ReLU().parameters()) crashes.

Stop conditions

Done when:

  1. Linear is ≤ 25 lines including type hints.
  2. Each activation is ≤ 6 lines.
  3. All tests in Blocks E–H green.
  4. mypy --strict src/minimodel/nn/ clean.
  5. ruff check src/minimodel/nn/ clean.
  6. You can explain why weight is shape (out, in) and not (in, out).
  7. You can explain why Softmax subtracts the max before exp (numerical, not mathematical).

When to consult solutions/

After all tests pass. solutions/01-linear-and-activations-ref.md (at phase open) compares your modules against the canonical implementation, with notes on alternative initializations (Xavier vs Kaiming) and the bias-vs-no-bias debate.


Next lab: lab/02-optimizers.md.