English · Español

Lab 01 — `Linear` and activation modules¶

Goal: implement Linear, ReLU, Tanh, Sigmoid, and Softmax as subclasses of the Module you built in Lab 00. Each is a thin wrapper around minitorch.Tensor ops; together they give you everything needed for the MLP in Lab 03. ~50 LOC by Borja.

Estimated time: 90–120 minutes.

Prereqs: Lab 00 closed (Parameter and Module green). Theory 02 read.

🇪🇸 Una vez que Module descubre parámetros, una Linear cabe en quince líneas y las activaciones en cuatro cada una. La única decisión interesante es la inicialización (Kaiming-uniform por defecto, Phase 10 la derivará) y la estabilidad numérica del softmax (restar el máximo antes del exp). Todo lo demás es plomería sobre Tensor.

What you produce¶

A new set of files in src/minimodel/nn/:

linear.py — Linear(in_features, out_features, bias=True).
activations.py — ReLU, Tanh, Sigmoid, Softmax(dim=-1).
__init__.py re-exports updated to include the new symbols.

And tests/test_linear_and_activations.py with shape, gradcheck, and state-dict tests.

TODOs¶

Block A — `Linear`¶

In src/minimodel/nn/linear.py:

Import math, numpy as np, Tensor from minitorch, and Parameter, Module from .module.
Define Linear(Module):

class Linear(Module):
    """Affine layer: y = x @ W.T + b. Stores weight as (out, in) per PyTorch."""

    def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        # Kaiming-uniform default; Phase 10 will derive the magnitude.
        bound = 1.0 / math.sqrt(in_features)
        # TODO: initialize self.weight as Parameter of shape (out_features, in_features)
        #       uniform in [-bound, +bound].
        raise NotImplementedError
        # TODO: if bias, set self.bias = Parameter(np.zeros(out_features));
        #       else set self.bias = None (do NOT register as Parameter).

    def forward(self, x: Tensor) -> Tensor:
        # Input x has shape (..., in_features). Output has shape (..., out_features).
        # TODO: y = x @ self.weight.transpose((1, 0))
        # TODO: if self.bias is not None, y = y + self.bias
        raise NotImplementedError

Note the bias-less case: self.bias = None must keep Module.__setattr__ from registering it (it isn't a Parameter or a Module, so it falls through to plain attribute storage).

Block B — `ReLU`, `Tanh`, `Sigmoid`¶

In src/minimodel/nn/activations.py:

Each is a one-method Module. No parameters; parameters() yields nothing.

class ReLU(Module):
    def forward(self, x: Tensor) -> Tensor:
        # TODO: return x.relu() (or x.maximum(0) if relu() lives elsewhere).
        raise NotImplementedError


class Tanh(Module):
    def forward(self, x: Tensor) -> Tensor:
        # TODO: return x.tanh()
        raise NotImplementedError


class Sigmoid(Module):
    def forward(self, x: Tensor) -> Tensor:
        # TODO: return x.sigmoid()
        raise NotImplementedError

If minitorch.Tensor lacks any of relu, tanh, sigmoid, note the gap in your journal and add a TODO in minitorch — do NOT implement the op inside activations.py. Activations are modules; the math lives in the tensor library.

Block C — `Softmax`¶

Softmax(dim: int = -1) needs the subtract-max trick for numerical stability:

class Softmax(Module):
    def __init__(self, dim: int = -1) -> None:
        super().__init__()
        self.dim = dim

    def forward(self, x: Tensor) -> Tensor:
        # Numerical-stability trick: subtract the per-row max BEFORE exp.
        # x_shifted = x - x.max(dim=self.dim, keepdim=True)
        # exp_x    = x_shifted.exp()
        # return exp_x / exp_x.sum(dim=self.dim, keepdim=True)
        raise NotImplementedError

The "subtract max" is not an optimization — it prevents exp(large_number) from overflowing FP32 (≈ 88.7 is the ceiling). Without it, a logit of 100 returns inf and the gradient is nan.

Block D — `init.py` re-exports¶

In src/minimodel/nn/__init__.py:

Add from .linear import Linear.
Add from .activations import ReLU, Tanh, Sigmoid, Softmax.
Keep __all__ sorted and explicit.

Tests¶

In tests/test_linear_and_activations.py:

Block E — `Linear` shape and parameters¶

test_linear_shapes_batched:

layer = Linear(4, 3)
x = Tensor(np.random.randn(2, 4))   # (B=2, in=4)
y = layer(x)
assert y.data.shape == (2, 3)

test_linear_parameters_count:

layer = Linear(4, 3)
params = list(layer.parameters())
assert len(params) == 2                     # weight, bias
assert layer.weight.data.shape == (3, 4)    # (out, in)
assert layer.bias.data.shape == (3,)

test_linear_no_bias:

layer = Linear(4, 3, bias=False)
params = list(layer.parameters())
assert len(params) == 1
assert layer.bias is None

test_linear_higher_rank_input: Input shape (B, T, in) → output (B, T, out). Verifies that the transpose + matmul broadcasts on the trailing dim.

Block F — Gradcheck¶

test_relu_gradcheck: Numeric finite-difference gradient vs autograd for ReLU. Sample 10 random inputs in [-1, 1]; finite-diff (f(x+ε) - f(x-ε)) / (2ε) with ε = 1e-4; compare to x.grad from y.sum().backward(). Tolerance 1e-4. Skip points where |x| < ε — ReLU is non-differentiable at 0.
test_tanh_gradcheck, test_sigmoid_gradcheck: same shape, no skip.
test_softmax_gradcheck: Softmax + sum is differentiable everywhere. Use a 4-dim input. Tolerance 1e-4.
test_linear_gradcheck: Gradcheck on both weight and bias for a Linear(2, 3) with a (2, 2) input batch.

Block G — Softmax numerical stability¶

test_softmax_large_logits_no_nan:

x = Tensor(np.array([[100.0, 100.5, 101.0]]))
y = Softmax(dim=-1)(x)
assert not np.isnan(y.data).any()
assert np.allclose(y.data.sum(axis=-1), 1.0)

Without the subtract-max trick this test fails with nan.

test_softmax_sums_to_one: For a random (4, 7) input, every row of the output sums to 1 within 1e-6.

Block H — Serialization round-trip¶

test_linear_state_dict_keys:

state = Linear(4, 3).state_dict()
assert set(state.keys()) == {"weight", "bias"}

test_linear_load_state_dict_roundtrip: Build two Linear(4, 3) with different random seeds; copy state from one to the other; assert the second's weight.data and bias.data are element-wise equal to the first's.
test_sequential_of_linear_activation: Construct (via a tiny ad-hoc Sequential-like list, or with the real Sequential if Block I is done) a stack Linear(4, 8) → ReLU → Linear(8, 3). Verify it has 4 parameters in order: weight, bias, weight, bias. Forward pass on (2, 4) returns (2, 3).

Constraints¶

No PyTorch in src/minimodel/. PyTorch reference values (if any) live in test fixtures only, and only for cross-checks — Lab 02 uses one such cross-check, not this lab.
No new ops in Tensor. If minitorch is missing relu/tanh/sigmoid/max/exp/sum, fix minitorch first and write the missing-op note in your journal.
A13 scope unchanged. No verb-specific code in this lab; activations and Linear are domain-agnostic.
No nn.Module from PyTorch sneaking in. Our Module is from minimodel.nn.module.

Pitfalls¶

In-place ops break autograd. x.data -= ... inside a forward pass corrupts the saved tensor and the backward pass returns wrong gradients. Always produce a new Tensor from ops.
Missing requires_grad propagation. If you wrap intermediate values in fresh Tensor(np.array(...)) calls (e.g. for the subtract-max trick), you can sever the autograd graph. Use tensor ops; do not construct new leaf tensors mid-forward.
Softmax without the max trick. exp(100.0) overflows FP32. Subtract the per-row max before exp.
Linear weight shape. PyTorch stores (out, in) and transposes in forward. We match that — checkpoints transfer to PyTorch in Phase 18.
Bias-less Linear registering None as a Parameter. Test: bias=False must leave _parameters without a "bias" key. Module.__setattr__ already handles this if you set self.bias = None; verify.
Activation __init__ forgetting super().__init__(). The Module base needs _parameters and _modules initialized, even for parameter-free activations. Without super().__init__(), list(ReLU().parameters()) crashes.

Stop conditions¶

Done when:

Linear is ≤ 25 lines including type hints.
Each activation is ≤ 6 lines.
All tests in Blocks E–H green.
mypy --strict src/minimodel/nn/ clean.
ruff check src/minimodel/nn/ clean.
You can explain why weight is shape (out, in) and not (in, out).
You can explain why Softmax subtracts the max before exp (numerical, not mathematical).

When to consult `solutions/`¶

After all tests pass. solutions/01-linear-and-activations-ref.md (at phase open) compares your modules against the canonical implementation, with notes on alternative initializations (Xavier vs Kaiming) and the bias-vs-no-bias debate.

Next lab: lab/02-optimizers.md.

Lab 01 — Linear and activation modules¶