English · Español
Lab 01 — Linear and activation modules¶
Goal: implement
Linear,ReLU,Tanh,Sigmoid, andSoftmaxas subclasses of theModuleyou built in Lab 00. Each is a thin wrapper aroundminitorch.Tensorops; together they give you everything needed for the MLP in Lab 03. ~50 LOC by Borja.Estimated time: 90–120 minutes.
Prereqs: Lab 00 closed (
ParameterandModulegreen). Theory 02 read.
🇪🇸 Una vez que
Moduledescubre parámetros, unaLinearcabe en quince líneas y las activaciones en cuatro cada una. La única decisión interesante es la inicialización (Kaiming-uniform por defecto, Phase 10 la derivará) y la estabilidad numérica del softmax (restar el máximo antes delexp). Todo lo demás es plomería sobreTensor.
What you produce¶
A new set of files in src/minimodel/nn/:
linear.py—Linear(in_features, out_features, bias=True).activations.py—ReLU,Tanh,Sigmoid,Softmax(dim=-1).__init__.pyre-exports updated to include the new symbols.
And tests/test_linear_and_activations.py with shape, gradcheck, and state-dict tests.
TODOs¶
Block A — Linear¶
In src/minimodel/nn/linear.py:
- Import
math,numpy as np,Tensorfromminitorch, andParameter, Modulefrom.module. - Define
Linear(Module):
class Linear(Module):
"""Affine layer: y = x @ W.T + b. Stores weight as (out, in) per PyTorch."""
def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
super().__init__()
self.in_features = in_features
self.out_features = out_features
# Kaiming-uniform default; Phase 10 will derive the magnitude.
bound = 1.0 / math.sqrt(in_features)
# TODO: initialize self.weight as Parameter of shape (out_features, in_features)
# uniform in [-bound, +bound].
raise NotImplementedError
# TODO: if bias, set self.bias = Parameter(np.zeros(out_features));
# else set self.bias = None (do NOT register as Parameter).
def forward(self, x: Tensor) -> Tensor:
# Input x has shape (..., in_features). Output has shape (..., out_features).
# TODO: y = x @ self.weight.transpose((1, 0))
# TODO: if self.bias is not None, y = y + self.bias
raise NotImplementedError
- Note the bias-less case:
self.bias = Nonemust keepModule.__setattr__from registering it (it isn't aParameteror aModule, so it falls through to plain attribute storage).
Block B — ReLU, Tanh, Sigmoid¶
In src/minimodel/nn/activations.py:
- Each is a one-method
Module. No parameters;parameters()yields nothing.
class ReLU(Module):
def forward(self, x: Tensor) -> Tensor:
# TODO: return x.relu() (or x.maximum(0) if relu() lives elsewhere).
raise NotImplementedError
class Tanh(Module):
def forward(self, x: Tensor) -> Tensor:
# TODO: return x.tanh()
raise NotImplementedError
class Sigmoid(Module):
def forward(self, x: Tensor) -> Tensor:
# TODO: return x.sigmoid()
raise NotImplementedError
If minitorch.Tensor lacks any of relu, tanh, sigmoid, note the gap in your journal and add a TODO in minitorch — do NOT implement the op inside activations.py. Activations are modules; the math lives in the tensor library.
Block C — Softmax¶
-
Softmax(dim: int = -1)needs the subtract-max trick for numerical stability:
class Softmax(Module):
def __init__(self, dim: int = -1) -> None:
super().__init__()
self.dim = dim
def forward(self, x: Tensor) -> Tensor:
# Numerical-stability trick: subtract the per-row max BEFORE exp.
# x_shifted = x - x.max(dim=self.dim, keepdim=True)
# exp_x = x_shifted.exp()
# return exp_x / exp_x.sum(dim=self.dim, keepdim=True)
raise NotImplementedError
- The "subtract max" is not an optimization — it prevents
exp(large_number)from overflowing FP32 (≈ 88.7 is the ceiling). Without it, a logit of 100 returnsinfand the gradient isnan.
Block D — __init__.py re-exports¶
In src/minimodel/nn/__init__.py:
- Add
from .linear import Linear. - Add
from .activations import ReLU, Tanh, Sigmoid, Softmax. - Keep
__all__sorted and explicit.
Tests¶
In tests/test_linear_and_activations.py:
Block E — Linear shape and parameters¶
-
test_linear_shapes_batched: -
test_linear_parameters_count: -
test_linear_no_bias: -
test_linear_higher_rank_input: Input shape(B, T, in)→ output(B, T, out). Verifies that thetranspose + matmulbroadcasts on the trailing dim.
Block F — Gradcheck¶
-
test_relu_gradcheck: Numeric finite-difference gradient vs autograd forReLU. Sample 10 random inputs in[-1, 1]; finite-diff(f(x+ε) - f(x-ε)) / (2ε)withε = 1e-4; compare tox.gradfromy.sum().backward(). Tolerance1e-4. Skip points where|x| < ε— ReLU is non-differentiable at 0. -
test_tanh_gradcheck,test_sigmoid_gradcheck: same shape, no skip. -
test_softmax_gradcheck: Softmax + sum is differentiable everywhere. Use a 4-dim input. Tolerance1e-4. -
test_linear_gradcheck: Gradcheck on bothweightandbiasfor aLinear(2, 3)with a(2, 2)input batch.
Block G — Softmax numerical stability¶
-
test_softmax_large_logits_no_nan:Without the subtract-max trick this test fails withx = Tensor(np.array([[100.0, 100.5, 101.0]])) y = Softmax(dim=-1)(x) assert not np.isnan(y.data).any() assert np.allclose(y.data.sum(axis=-1), 1.0)nan. -
test_softmax_sums_to_one: For a random(4, 7)input, every row of the output sums to 1 within1e-6.
Block H — Serialization round-trip¶
-
test_linear_state_dict_keys: -
test_linear_load_state_dict_roundtrip: Build twoLinear(4, 3)with different random seeds; copy state from one to the other; assert the second'sweight.dataandbias.dataare element-wise equal to the first's. -
test_sequential_of_linear_activation: Construct (via a tiny ad-hocSequential-like list, or with the realSequentialif Block I is done) a stackLinear(4, 8) → ReLU → Linear(8, 3). Verify it has 4 parameters in order:weight, bias, weight, bias. Forward pass on(2, 4)returns(2, 3).
Constraints¶
- No PyTorch in
src/minimodel/. PyTorch reference values (if any) live in test fixtures only, and only for cross-checks — Lab 02 uses one such cross-check, not this lab. - No new ops in
Tensor. Ifminitorchis missingrelu/tanh/sigmoid/max/exp/sum, fixminitorchfirst and write the missing-op note in your journal. - A13 scope unchanged. No verb-specific code in this lab; activations and
Linearare domain-agnostic. - No
nn.Modulefrom PyTorch sneaking in. OurModuleis fromminimodel.nn.module.
Pitfalls¶
- In-place ops break autograd.
x.data -= ...inside a forward pass corrupts the saved tensor and the backward pass returns wrong gradients. Always produce a newTensorfrom ops. - Missing
requires_gradpropagation. If you wrap intermediate values in freshTensor(np.array(...))calls (e.g. for the subtract-max trick), you can sever the autograd graph. Use tensor ops; do not construct new leaf tensors mid-forward. - Softmax without the max trick.
exp(100.0)overflows FP32. Subtract the per-row max beforeexp. Linearweight shape. PyTorch stores(out, in)and transposes in forward. We match that — checkpoints transfer to PyTorch in Phase 18.- Bias-less
LinearregisteringNoneas aParameter. Test:bias=Falsemust leave_parameterswithout a"bias"key.Module.__setattr__already handles this if you setself.bias = None; verify. - Activation
__init__forgettingsuper().__init__(). TheModulebase needs_parametersand_modulesinitialized, even for parameter-free activations. Withoutsuper().__init__(),list(ReLU().parameters())crashes.
Stop conditions¶
Done when:
Linearis ≤ 25 lines including type hints.- Each activation is ≤ 6 lines.
- All tests in Blocks E–H green.
mypy --strict src/minimodel/nn/clean.ruff check src/minimodel/nn/clean.- You can explain why
weightis shape(out, in)and not(in, out). - You can explain why
Softmaxsubtracts the max beforeexp(numerical, not mathematical).
When to consult solutions/¶
After all tests pass. solutions/01-linear-and-activations-ref.md (at phase open) compares your modules against the canonical implementation, with notes on alternative initializations (Xavier vs Kaiming) and the bias-vs-no-bias debate.
Next lab: lab/02-optimizers.md.