English · Español
Lab 03 — Train a tense-classifier MLP¶
Goal: close Phase 09 by training an end-to-end MLP on a 5-way tense-classification task built from the §A13 verb-grammar grid. Input: a 64-dim embedding lookup over the 20 verbs (mocked — Phase 13 builds the real one). Output: 5-class logits over
(infinitive, present-3sg, past-simple, past-participle, future-will). Hand-rolled training loop, noTrainerclass (Phase 18). Target: > 90% train accuracy in < 100 epochs.Estimated time: 120–180 minutes.
Prereqs: Labs 00, 01, 02 closed. Theory 02 + 03 read.
🇪🇸 Esta es la primera vez en el currículum que el modelo aprende algo del idioma. La forma del problema es lo más simple posible — identidad sobre 100 ejemplos, sin generalización — porque el objetivo pedagógico es que veas el bucle de entrenamiento funcionar end-to-end con piezas que tú mismo construiste:
Tensor,Module,Linear,CrossEntropyLoss,Adam. La generalización viene en Phase 18; la gramática real, en Phase 12+.
What you produce¶
experiments/09-tense-mlp-lab03/train.py— the training script.experiments/09-tense-mlp-lab03/manifest.json— versions + seed + config (per CLAUDE.md §0.5).experiments/09-tense-mlp-lab03/loss_curve.png— train loss over epochs.
(experiments/09-tense-mlp/ from PHASE_09_PLAN.md §3 is the richer one — one-hot verb ⊕ one-hot person, 23-dim input. This lab is the smaller variant — embedding-input only, no person — kept simpler so you focus on the loop, not the data pipeline.)
Data spec¶
20 verbs × 5 tense forms = 100 examples. Persons are NOT in this lab's input — Lab 03 classifies the tense form of a single verb conjugation, not the agreement with a subject.
The 20 verbs (§A13):
- Regular (12):
work, play, walk, talk, listen, watch, study, finish, start, look, want, like. - Irregular (8):
be, have, do, go, come, see, eat, write.
The 5 tense classes (output labels 0–4):
| label | name | example for "work" |
|---|---|---|
| 0 | infinitive | work |
| 1 | present-3sg | works |
| 2 | past-simple | worked |
| 3 | past-participle | worked |
| 4 | future-will | will work |
Note: for regular verbs, label 2 and label 3 share the surface form (worked). The model still has to separate them by embedding context. Since each (verb, tense) pair gets its own mocked embedding (see below), the input is unique per example — so the task is memorization, not surface-form disambiguation. Document this in your journal: real disambiguation needs a sentence context (Phase 17).
The embedding lookup (MOCKED)¶
Phase 13 builds the real embedding layer. For Lab 03, mock it with a deterministic random table:
# In data/grammar/embedding_mock.py (or inline in train.py for the lab):
EMBEDDING_DIM = 64
NUM_FORMS = 100 # 20 verbs × 5 tenses
def build_mock_embedding_table(seed: int = 0) -> np.ndarray:
rng = np.random.default_rng(seed)
# Each of the 100 (verb, tense) pairs gets a fixed 64-dim vector.
return rng.standard_normal((NUM_FORMS, EMBEDDING_DIM)).astype(np.float32)
Each training example is a row of this table; the label is the tense index (0–4) of that row. This is identity classification — the input uniquely determines the output. That's the point: confirm the machinery learns identity, before Phase 18 introduces the harder, generalizing tasks.
(input, label) enumeration¶
def enumerate_examples(table: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
# X: (100, 64) float32
# y: (100,) int64, values in {0, 1, 2, 3, 4}
X = table
y = np.tile(np.arange(5, dtype=np.int64), 20) # 0,1,2,3,4,0,1,2,3,4,...
return X, y
Indexing convention: row 5*v + t is verb v, tense t. Document in manifest.json.
Model¶
A 2-layer MLP:
Parameter count: 64·32 + 32 + 32·5 + 5 = 2048 + 32 + 160 + 5 = 2245. Slightly over-parameterized for 100 examples — exactly what we want for the identity task to memorize cleanly.
class TenseMLP(Module):
def __init__(self) -> None:
super().__init__()
self.fc1 = Linear(64, 32)
self.act = ReLU()
self.fc2 = Linear(32, 5)
def forward(self, x: Tensor) -> Tensor:
# TODO: x → fc1 → act → fc2 → logits (shape (B, 5))
raise NotImplementedError
TODOs¶
Block A — Loss: CrossEntropyLoss¶
Either implement it as a Module in src/minimodel/nn/losses.py (preferred — PHASE_09_PLAN.md §3 lists it), or inline the math in train.py for this lab and refactor later.
Mathematically: CE(logits, y) = -log_softmax(logits)[y], averaged over the batch.
Use the log-sum-exp trick for numerical stability — same reason as Softmax in Lab 01:
class CrossEntropyLoss(Module):
def forward(self, logits: Tensor, targets: Tensor) -> Tensor:
# TODO:
# 1. Shift logits by row-max for numeric stability.
# 2. log_softmax = shifted - log(sum(exp(shifted), dim=-1, keepdim=True)).
# 3. Gather the per-row log-softmax at the target index (advanced indexing).
# 4. Return -mean of the gathered values.
raise NotImplementedError
If minitorch.Tensor lacks gather/advanced-indexing, drop down to a manual one-hot multiplication:
# one_hot: (B, 5) with 1 at the target column, 0 elsewhere.
# loss = -(one_hot * log_softmax).sum(dim=-1).mean()
Block B — Training loop¶
def train(seed: int = 0) -> dict[str, list[float]]:
seed_everything(seed)
table = build_mock_embedding_table(seed=seed)
X_np, y_np = enumerate_examples(table)
# Full-batch training — 100 examples fits in memory trivially.
X = Tensor(X_np)
y = Tensor(y_np) # integer labels; CE handles the gather
model = TenseMLP()
loss_fn = CrossEntropyLoss()
opt = Adam(model.parameters(), lr=1e-2)
history: dict[str, list[float]] = {"loss": [], "acc": []}
for epoch in range(100):
opt.zero_grad()
logits = model(X) # (100, 5)
loss = loss_fn(logits, y)
loss.backward()
opt.step()
# Accuracy:
preds = np.argmax(logits.data, axis=-1) # (100,)
acc = float((preds == y_np).mean())
history["loss"].append(float(loss.data))
history["acc"].append(acc)
if epoch % 10 == 0:
print(f"epoch {epoch:3d} loss={loss.data:.4f} acc={acc:.3f}")
return history
- Wrap
train()in a__main__block that also writesmanifest.json(seed, versions ofminimodel,minitorch,numpy, Python; config dict). - Plot
history["loss"]andhistory["acc"]toloss_curve.png.
Block C — Manifest¶
Per CLAUDE.md §0.5:
import json, platform, sys
import numpy
manifest = {
"seed": seed,
"versions": {
"python": sys.version,
"platform": platform.platform(),
"numpy": numpy.__version__,
# "minimodel": ..., "minitorch": ...
},
"config": {
"embedding_dim": 64,
"num_classes": 5,
"num_examples": 100,
"epochs": 100,
"optimizer": "Adam",
"lr": 1e-2,
"hidden": 32,
"loss": "CrossEntropyLoss",
},
"final_loss": history["loss"][-1],
"final_acc": history["acc"][-1],
}
with open("manifest.json", "w") as f:
json.dump(manifest, f, indent=2)
Block D — Acceptance check¶
- Run
python train.py. The final epoch's accuracy must be> 0.90. - Loss curve must be monotonically (broadly) decreasing — small ripples are OK with
lr=1e-2, large oscillations mean the LR is too high. - Commit
loss_curve.pngandmanifest.jsontoexperiments/09-tense-mlp-lab03/.
Block E — Sanity tests (in tests/test_train_tense_mlp.py)¶
-
test_table_shape:build_mock_embedding_table()returns(100, 64)float32. -
test_label_distribution: each of the 5 classes has exactly 20 examples. -
test_cross_entropy_against_manual: hand-compute CE on a(2, 3)logits batch with labels[0, 2]; assert the module's output matches within1e-6. -
test_one_epoch_reduces_loss: a single training step on a fresh model reduces the loss (strict inequality). Catches dead-on-arrival bugs.
What 'done' looks like¶
- Final epoch reports
acc > 0.90with the defaultseed=0. loss_curve.pngshows loss falling from ~1.6 (≈log(5), random-init baseline) to below0.3.manifest.jsonis committed, withseed,versions,config,final_loss,final_acc.- All four sanity tests green.
Common pitfalls¶
- Tokens vs embeddings confusion. The model in this lab consumes embeddings (real-valued vectors), not token IDs. The mocked embedding table sits in front of the model; the model never sees the integer index. Phase 11 (BPE) and Phase 13 (real embedding layer) close this gap. If you find yourself wanting to feed the integer
5*v + ttofc1, stop — you have a layer-order confusion. - Over-parameterization fragility. With 2245 parameters and 100 examples, you are in the memorization regime. If accuracy plateaus around
0.20(random), the bug is in the loss or the loop, not the capacity. If accuracy reaches1.00in 5 epochs, you have not made an error — that's expected. The exercise is the loop, not the generalization. - Learning rate too high collapses softmax.
lr=1e-1withAdamproduces giant updates that send logits to ±100;Softmaxsaturates and gradients vanish. If accuracy oscillates near0.20for many epochs, halve the LR. - Forgetting
opt.zero_grad(). Gradients accumulate (Phase ⅞ invariant). Withoutzero_grad, the second epoch's gradient is the sum of two backward passes, and the update is wrong by a factor of 2 (then 3, then 4...). Symptom: loss diverges immediately. Lab 02 already drilled this; verify in Block E. - Identity task != generalization. Do NOT report this lab's accuracy as a model-quality result. It's an integration test of the framework. Phase 18 introduces train/val splits and the difference becomes meaningful.
- In-place ops in the loss.
logits -= logits.max(...)mutates the autograd graph. Always use out-of-place ops in the CE forward. - Random seed not honored.
seed_everything(seed)must precede both the embedding-table construction and the model init, or the run isn't reproducible. Manifest claims reproducibility — verify by running twice with the same seed and diffing the final weights.
Constraints¶
- No
Trainerclass. Hand-rolled loop only. Phase 18 introduces the abstraction once you've felt the boilerplate. - No PyTorch. All ops via
minitorchandminimodel. - A13 scope strict. 20 verbs, 5 tense forms, no plurals, no Spanish in this lab (the Spanish pair is part of the corpus from Phase 12 onward; here we only learn tense identity).
- Mocked embeddings only. Do not build a real embedding lookup — Phase 13 is its phase. Note the dependency in
manifest.jsonunder adependencieskey ({"phase-13-embeddings": "mocked-by-random-table"}). - Full-batch, not minibatched. 100 examples → one batch. Minibatching is Phase 18.
Stop conditions¶
Done when:
train.pyreachesacc > 0.90deterministically withseed=0.manifest.jsonandloss_curve.pngare committed.- All four sanity tests pass.
- You can explain in one paragraph why this lab's accuracy number is not a generalization result.
- You can predict the final loss order-of-magnitude before running (≈
0.1–0.3foracc ≈ 0.95).
When to consult solutions/¶
After the run hits acc > 0.90. solutions/03-train-tense-mlp-ref.md (at phase open) compares the loop to the canonical version, with notes on what Phase 18's Trainer will refactor out, and how Phase 13's real embedding layer will replace the mocked table.
Next phase: Phase 10 — initialization, normalization, regularization. PHASE_10_PLAN.md.