English · Español

Lab 02 — Train a scalar MLP on a tiny tense-identity task using only `minigrad.scalar`¶

Goal: build a 2-layer MLP from Value neurons, train it on a microscopic grammar-tense task, and see a loss curve descend. The smallest, most pedagogically pure end-to-end ML training run that exists, anchored in the §A13 verb-grammar domain.

Estimated time: 90–120 minutes.

Prereqs: lab 00, lab 01 (all ops implemented and tested).

The task (the §A13 anchor)¶

Pick one verb — let's say work. Its 5 tenses are:

index	tense	English form	Spanish
0	infinitive	`(to) work`	trabajar
1	present (3^rd sg)	`works`	trabaja
2	past simple	`worked`	trabajó
3	past participle	`worked`	trabajado
4	future (will)	`will work`	trabajará

The task is the 5-way tense-identity mapping: given a 5-dim one-hot input encoding "which tense is this", produce a 5-dim output whose argmax equals the input's argmax. It is artificially simple — a perfect autoencoder for a one-hot — but that is exactly the point: every nontrivial component of the model (autograd, parameters, loss, training loop) must be correct for the network to learn this. Any failure is a Phase-7 bug, not a hard-problem bug.

What you produce¶

A directory experiments/07-train-tense-logits/ containing:

model.py — Neuron, Layer, MLP classes built from Value. ~60 lines.
train.py — the training loop. ~40 lines.
loss.png — loss curve over training.
predictions.json — the trained MLP's outputs on the 5 inputs.
manifest.json — standard schema.
README.md — what you trained, what loss you reached, how long it took, what you'd improve.

Plus a separate directory experiments/07-visualize-graph/:

viz.py — builds a small expression, renders the DAG via graphviz, saves as SVG.
graph.svg — the rendered graph, nodes labeled with forward data and backward grad.
manifest.json.

TODOs (experiment 1: tense identity)¶

Block A — `Neuron`, `Layer`, `MLP`¶

In model.py:

Neuron: takes n_in inputs. Owns w: list[Value] of length n_in (randomly initialised to small values, e.g., from random.uniform(-1, 1)) and b: Value (initialised to 0). __call__(self, xs: list[Value]) returns (sum(wᵢ · xᵢ) + b).tanh().
Layer: takes n_in, n_out. Owns neurons: list[Neuron]. __call__(self, xs) returns the list of each neuron's output.
MLP: takes n_in, layer_sizes: list[int]. Owns layers: list[Layer]. __call__(self, xs) chains them. A multi-output last layer (the case here — 5 outputs) returns the list, not a single Value.
parameters(self) method on each (return list of Value for all weights and biases). Phase 9 introduces Parameter; for Phase 7 just collect Values manually.

Block B — Tense dataset¶

The 5 input/target pairs are the 5 one-hot tense vectors for work:

input                target
(1, 0, 0, 0, 0)  →  (1, -1, -1, -1, -1)    # infinitive  / to work
(0, 1, 0, 0, 0)  →  (-1, 1, -1, -1, -1)    # present 3sg / works
(0, 0, 1, 0, 0)  →  (-1, -1, 1, -1, -1)    # past simple / worked
(0, 0, 0, 1, 0)  →  (-1, -1, -1, 1, -1)    # participle  / worked
(0, 0, 0, 0, 1)  →  (-1, -1, -1, -1, 1)    # future      / will work

Encode as xs: list[list[Value]] (5 inputs, each a 5-vector of Values) and ys: list[list[Value]] (5 targets).
Use tanh activations in the model. With tanh, the "0 label" is best encoded as -1 (since tanh outputs are in (-1, 1)). Use {-1, 1} encoding for the targets.

🇪🇸 La tarea es deliberadamente trivial: dado un one-hot de "qué tiempo verbal", devuelve ese mismo one-hot. La gracia no está en aprender gramática — eso lo hace la fase 9 con el grid completo — sino en confirmar que tu autograd, tu loss y tu loop de entrenamiento funcionan extremo a extremo.

Block C — training loop¶

In train.py:

Instantiate model = MLP(5, [4, 5]) — 5 inputs (one-hot tense), one hidden layer of 4 neurons, 5 outputs.
Hyperparams: lr = 0.05, n_epochs = 300.
For each epoch:
Compute predictions: preds = [model(x) for x in xs]. Each pred is a list of 5 Values.
Compute loss: loss = sum((p - y)**2 for pred, target in zip(preds, ys) for p, y in zip(pred, target)). (Sum of squared errors over the 25 logits = 5 inputs × 5 outputs.) Loss is a Value.
Zero gradients on all parameters: for p in model.parameters(): p.grad = 0.0.
loss.backward().
Update parameters: for p in model.parameters(): p.data -= lr * p.grad.
Log epoch, loss.data to a structured logger (using Phase 6's get_logger).
Plot loss vs epoch with matplotlib. Save as loss.png.
After training, run model(x) for each tense one-hot. Save the outputs as predictions.json.

Block D — assert success¶

In your train.py (or a separate verify.py):

Assert final loss < 0.5 (5 outputs × 5 examples = 25 logits; loose per-logit target ~0.02).
Assert argmax(model(x)) equals the input's argmax for all 5 inputs.

If either assertion fails, your training didn't converge. Diagnose: - Loss not decreasing? Likely a backward bug. Re-run unit tests. - Loss decreasing then exploding? lr too high. Try 0.01. - Loss decreasing slowly? lr too low or hidden layer too small.

TODOs (experiment 2: visualize)¶

Block E — graph visualization¶

In experiments/07-visualize-graph/viz.py:

Pick a small expression: e.g., the diamond example from theory/03: L = (a*b + c)*(a - c) with a=2, b=3, c=4.
Build it with minigrad.scalar. Call L.backward().
Use graphviz (Python binding) to construct a Digraph. For each Value node, add a node labeled with {op | data | grad}. Add edges from each _prev to the node.
Render as SVG: dot.render('graph', format='svg', cleanup=True).
Save graph.svg.
Print the path. Open in browser. Confirm:
Nodes show data and grad.
Diamond shape visible: a has two outgoing edges.

This is the visualization the spec calls for in §4 PHASE 7.

Block F — manifest for both¶

Standard schema per Phase 6 lab 00. Include in config:

For tense identity: hyperparams (lr, epochs, layer_sizes, seed, init range, verb chosen).
For viz: which expression was rendered, graphviz version.

Constraints¶

Only minigrad.scalar. No NumPy in the model or training loop. Lists of Value only.
Use Phase 6 utilities. seed_everything(42), get_logger(__name__) — no print.
graphviz must be installed. On Fedora: dnf install graphviz for the system package + pip install graphviz for the Python binding.
Reproducible. Same seed should produce the same final loss (within ~1% for floating-point noise).

Expected results¶

Final loss should reach ~0.05–0.5 within 300 epochs (25 logits, target per-logit ~0.01-0.02).
All 5 predictions should have argmax equal to the input's argmax.
graph.svg should clearly show the diamond pattern (a having two outgoing arrows).

Stop conditions¶

Done when:

loss.png shows monotone-decreasing loss to below 0.5.
All 5 tense-identity predictions correct (argmax matches).
graph.svg rendered, opens in a browser, visually correct.
Both manifest.json files exist with expected schema.

Pitfalls¶

Forgot to zero gradients. Loss explodes after the first epoch. Add p.grad = 0.0 before each backward().
Initialised weights to zero. All neurons compute the same output; no symmetry breaking; model doesn't learn. Initialise to random.uniform(-1, 1) or similar.
Used 0/1 labels with tanh outputs. tanh outputs in (-1, 1); with (0, 1) targets, the model wants to push outputs to 0 (mid-range), gradients are tiny, training is sluggish. Use (-1, 1) labels.
No seeding. Run-to-run variance is huge for tiny models. Call seed_everything(42) at top of train.py.
tanh saturation. tanh(very_large) is fine (saturates at ±1), but (1 - tanh²) gradient at saturation is ≈0 — vanishing gradients. With a 4-neuron hidden layer and lr=0.05 this rarely bites; if it does, lower the init range to random.uniform(-0.5, 0.5).
Graphviz not installed. Two layers: system package (dot command) and Python binding (pip install graphviz). Both must be present. Test with dot -V from shell.
Treating the 5 outputs as one scalar. Each model(x) returns a list of 5 Values, not a Value. Sum over both example and output axis when computing the loss; one forgotten loop here is the most common bug in this lab.

When to consult `solutions/`¶

After your tense experiment converges and graph.svg exists. Then solutions/02-train-tense-logits-ref.md (at phase open) provides the reference loss curve and visualization comparison.

End of Phase 7 labs. Next: write PHASE_07_REPORT.md and learners/borja/phase-07/reflections.md.

Lab 02 — Train a scalar MLP on a tiny tense-identity task using only minigrad.scalar¶