Skip to content

English · Español

Lab 02 — Train a scalar MLP on a tiny tense-identity task using only minigrad.scalar

Goal: build a 2-layer MLP from Value neurons, train it on a microscopic grammar-tense task, and see a loss curve descend. The smallest, most pedagogically pure end-to-end ML training run that exists, anchored in the §A13 verb-grammar domain.

Estimated time: 90–120 minutes.

Prereqs: lab 00, lab 01 (all ops implemented and tested).


The task (the §A13 anchor)

Pick one verb — let's say work. Its 5 tenses are:

index tense English form Spanish
0 infinitive (to) work trabajar
1 present (3rd sg) works trabaja
2 past simple worked trabajó
3 past participle worked trabajado
4 future (will) will work trabajará

The task is the 5-way tense-identity mapping: given a 5-dim one-hot input encoding "which tense is this", produce a 5-dim output whose argmax equals the input's argmax. It is artificially simple — a perfect autoencoder for a one-hot — but that is exactly the point: every nontrivial component of the model (autograd, parameters, loss, training loop) must be correct for the network to learn this. Any failure is a Phase-7 bug, not a hard-problem bug.

What you produce

A directory experiments/07-train-tense-logits/ containing:

  • model.pyNeuron, Layer, MLP classes built from Value. ~60 lines.
  • train.py — the training loop. ~40 lines.
  • loss.png — loss curve over training.
  • predictions.json — the trained MLP's outputs on the 5 inputs.
  • manifest.json — standard schema.
  • README.md — what you trained, what loss you reached, how long it took, what you'd improve.

Plus a separate directory experiments/07-visualize-graph/:

  • viz.py — builds a small expression, renders the DAG via graphviz, saves as SVG.
  • graph.svg — the rendered graph, nodes labeled with forward data and backward grad.
  • manifest.json.

TODOs (experiment 1: tense identity)

Block A — Neuron, Layer, MLP

In model.py:

  • Neuron: takes n_in inputs. Owns w: list[Value] of length n_in (randomly initialised to small values, e.g., from random.uniform(-1, 1)) and b: Value (initialised to 0). __call__(self, xs: list[Value]) returns (sum(wᵢ · xᵢ) + b).tanh().
  • Layer: takes n_in, n_out. Owns neurons: list[Neuron]. __call__(self, xs) returns the list of each neuron's output.
  • MLP: takes n_in, layer_sizes: list[int]. Owns layers: list[Layer]. __call__(self, xs) chains them. A multi-output last layer (the case here — 5 outputs) returns the list, not a single Value.
  • parameters(self) method on each (return list of Value for all weights and biases). Phase 9 introduces Parameter; for Phase 7 just collect Values manually.

Block B — Tense dataset

The 5 input/target pairs are the 5 one-hot tense vectors for work:

input                target
(1, 0, 0, 0, 0)  →  (1, -1, -1, -1, -1)    # infinitive  / to work
(0, 1, 0, 0, 0)  →  (-1, 1, -1, -1, -1)    # present 3sg / works
(0, 0, 1, 0, 0)  →  (-1, -1, 1, -1, -1)    # past simple / worked
(0, 0, 0, 1, 0)  →  (-1, -1, -1, 1, -1)    # participle  / worked
(0, 0, 0, 0, 1)  →  (-1, -1, -1, -1, 1)    # future      / will work
  • Encode as xs: list[list[Value]] (5 inputs, each a 5-vector of Values) and ys: list[list[Value]] (5 targets).
  • Use tanh activations in the model. With tanh, the "0 label" is best encoded as -1 (since tanh outputs are in (-1, 1)). Use {-1, 1} encoding for the targets.

🇪🇸 La tarea es deliberadamente trivial: dado un one-hot de "qué tiempo verbal", devuelve ese mismo one-hot. La gracia no está en aprender gramática — eso lo hace la fase 9 con el grid completo — sino en confirmar que tu autograd, tu loss y tu loop de entrenamiento funcionan extremo a extremo.

Block C — training loop

In train.py:

  • Instantiate model = MLP(5, [4, 5]) — 5 inputs (one-hot tense), one hidden layer of 4 neurons, 5 outputs.
  • Hyperparams: lr = 0.05, n_epochs = 300.
  • For each epoch:
  • Compute predictions: preds = [model(x) for x in xs]. Each pred is a list of 5 Values.
  • Compute loss: loss = sum((p - y)**2 for pred, target in zip(preds, ys) for p, y in zip(pred, target)). (Sum of squared errors over the 25 logits = 5 inputs × 5 outputs.) Loss is a Value.
  • Zero gradients on all parameters: for p in model.parameters(): p.grad = 0.0.
  • loss.backward().
  • Update parameters: for p in model.parameters(): p.data -= lr * p.grad.
  • Log epoch, loss.data to a structured logger (using Phase 6's get_logger).
  • Plot loss vs epoch with matplotlib. Save as loss.png.
  • After training, run model(x) for each tense one-hot. Save the outputs as predictions.json.

Block D — assert success

In your train.py (or a separate verify.py):

  • Assert final loss < 0.5 (5 outputs × 5 examples = 25 logits; loose per-logit target ~0.02).
  • Assert argmax(model(x)) equals the input's argmax for all 5 inputs.

If either assertion fails, your training didn't converge. Diagnose: - Loss not decreasing? Likely a backward bug. Re-run unit tests. - Loss decreasing then exploding? lr too high. Try 0.01. - Loss decreasing slowly? lr too low or hidden layer too small.

TODOs (experiment 2: visualize)

Block E — graph visualization

In experiments/07-visualize-graph/viz.py:

  • Pick a small expression: e.g., the diamond example from theory/03: L = (a*b + c)*(a - c) with a=2, b=3, c=4.
  • Build it with minigrad.scalar. Call L.backward().
  • Use graphviz (Python binding) to construct a Digraph. For each Value node, add a node labeled with {op | data | grad}. Add edges from each _prev to the node.
  • Render as SVG: dot.render('graph', format='svg', cleanup=True).
  • Save graph.svg.
  • Print the path. Open in browser. Confirm:
  • Nodes show data and grad.
  • Diamond shape visible: a has two outgoing edges.

This is the visualization the spec calls for in §4 PHASE 7.

Block F — manifest for both

Standard schema per Phase 6 lab 00. Include in config:

  • For tense identity: hyperparams (lr, epochs, layer_sizes, seed, init range, verb chosen).
  • For viz: which expression was rendered, graphviz version.

Constraints

  • Only minigrad.scalar. No NumPy in the model or training loop. Lists of Value only.
  • Use Phase 6 utilities. seed_everything(42), get_logger(__name__) — no print.
  • graphviz must be installed. On Fedora: dnf install graphviz for the system package + pip install graphviz for the Python binding.
  • Reproducible. Same seed should produce the same final loss (within ~1% for floating-point noise).

Expected results

  • Final loss should reach ~0.05–0.5 within 300 epochs (25 logits, target per-logit ~0.01-0.02).
  • All 5 predictions should have argmax equal to the input's argmax.
  • graph.svg should clearly show the diamond pattern (a having two outgoing arrows).

Stop conditions

Done when:

  1. loss.png shows monotone-decreasing loss to below 0.5.
  2. All 5 tense-identity predictions correct (argmax matches).
  3. graph.svg rendered, opens in a browser, visually correct.
  4. Both manifest.json files exist with expected schema.

Pitfalls

  • Forgot to zero gradients. Loss explodes after the first epoch. Add p.grad = 0.0 before each backward().
  • Initialised weights to zero. All neurons compute the same output; no symmetry breaking; model doesn't learn. Initialise to random.uniform(-1, 1) or similar.
  • Used 0/1 labels with tanh outputs. tanh outputs in (-1, 1); with (0, 1) targets, the model wants to push outputs to 0 (mid-range), gradients are tiny, training is sluggish. Use (-1, 1) labels.
  • No seeding. Run-to-run variance is huge for tiny models. Call seed_everything(42) at top of train.py.
  • tanh saturation. tanh(very_large) is fine (saturates at ±1), but (1 - tanh²) gradient at saturation is ≈0 — vanishing gradients. With a 4-neuron hidden layer and lr=0.05 this rarely bites; if it does, lower the init range to random.uniform(-0.5, 0.5).
  • Graphviz not installed. Two layers: system package (dot command) and Python binding (pip install graphviz). Both must be present. Test with dot -V from shell.
  • Treating the 5 outputs as one scalar. Each model(x) returns a list of 5 Values, not a Value. Sum over both example and output axis when computing the loss; one forgotten loop here is the most common bug in this lab.

When to consult solutions/

After your tense experiment converges and graph.svg exists. Then solutions/02-train-tense-logits-ref.md (at phase open) provides the reference loss curve and visualization comparison.


End of Phase 7 labs. Next: write PHASE_07_REPORT.md and learners/borja/phase-07/reflections.md.