English · Español
Lab 01 — Init ablation: the headline three-curve experiment¶
Goal: show that init alone — same data, same architecture, same optimizer — turns a network from "trains" to "doesn't train".
Estimated time: 90–120 minutes.
Prereq: lab 00 committed.
What you produce¶
A directory experiments/10-init-matters/ containing:
train.py— your training script.losses.json— three loss trajectories (uniform, Xavier, Kaiming).loss_curves.png— three curves on the same axes.manifest.json— standard.README.md(3–4 paragraphs) explaining what you trained and what happened.
The setup¶
A 12-layer MLP with ReLU activations, hidden dim 256, trained for 1000 steps on the toy classification problem from Phase 9 (or generated synthetically here — see below).
Three identical runs differing only in init:
- Uniform
[-0.5, 0.5]. - Xavier
N(0, 1/n_in). - Kaiming
N(0, 2/n_in).
Expected outcome:
- Uniform: forward-pass blow-up → NaN within ~20 steps, or saturated ReLU → loss flatlines at random.
- Xavier: trains, but slowly. Loss curve descends sluggishly.
- Kaiming: trains as expected. Smooth descent.
The headline figure is the three curves on the same axes, clearly differentiated.
Toy data¶
If you didn't already build the Phase 9 toy "safe/unsafe token" dataset, generate a synthetic one:
- 1000 training examples, 200 val.
- Input: 256-dim Gaussian vectors.
- Label: 0 if the sum of the first 10 elements is negative, 1 otherwise (deterministic, linearly separable in the first 10 dims).
This is trivially learnable by a 1-layer net. The point of the experiment is not task difficulty — it's that bad init turns a trivially learnable task into an unsolvable one.
TODOs¶
Block A — write the training loop¶
- 12-layer MLP, hidden 256, ReLU activations between layers, no norm, no residual.
- Use
minitorch.nn.Linear+minitorch.nn.ReLUfrom Phase 9. - Loss: cross-entropy.
- Optimizer: SGD with momentum 0.9, LR 1e-3.
- Batch size 64.
- 1000 steps total.
- Log loss every step.
Block B — three runs¶
- Run with init = uniform[-0.5, 0.5]. Save loss trajectory.
- Run with init = Xavier. Save.
- Run with init = Kaiming. Save.
- Same seed for data shuffling across runs (so the three runs see identical batch sequences).
- Different seed for each init's weight sampling (so the inits themselves are independent draws).
Block C — plot¶
- matplotlib. x-axis: step (0 to 1000). y-axis: loss (linear or log — try both, pick the more informative).
- Three curves with distinct colors and a legend.
- Annotate: NaN markers if any run diverges; flatline markers if any run plateaus at random loss.
- Save as
loss_curves.png.
Block D — interpret¶
In README.md, answer:
- At what step did uniform init produce its first NaN (or saturation)?
- What is Xavier's final loss vs Kaiming's?
- Is Xavier training or almost not training? Compute the slope of its loss curve over the last 200 steps.
- Predict, without running: if you increased the depth from 12 to 24 layers, how would each of the three curves change? Briefly justify.
Block E — manifest¶
Standard. results_summary should include {steps_to_nan_uniform, final_loss_xavier, final_loss_kaiming}.
Constraints¶
- No normalization. Lab 02 is for that.
- No residuals. Lab 03 is for that.
- Same seed for data, independent seeds for weights. Don't conflate.
- Run on
performancegovernor. Variance in step time would noise the comparison.
Stop conditions¶
Done when:
- The directory has all five files.
- The three-curve plot is committed and visually crisp.
- The uniform run diverges or flatlines; the Kaiming run trains; the Xavier run is somewhere in between.
- The README answers all four Block D questions.
Pitfalls¶
- All three curves look identical. You probably aren't actually re-initializing between runs. Check that you
model = build_mlp(init=...)each time, not justoptimizer.zero_grad(). - All three curves diverge. Your LR is too high. Drop to 1e-4 and re-try.
- All three curves train fine. Your depth is too shallow. Crank to 24 layers.
- Uniform doesn't NaN, just plateaus. Could be that ReLU is saturating (all zero) — the gradient is zero everywhere, no NaN, but no learning. Note this in the README; it's a different failure mode than blow-up but the same underlying cause.
Hint of last resort¶
If 90 minutes in and uniform-init still trains: print np.var(model.layers[0].weight.data) after init. It should be 1/12 ≈ 0.083. If you accidentally drew uniform [-0.05, 0.05], variance is ≈ 0.0008 — basically Xavier-ish. Recheck your init bounds.
When to consult solutions/¶
After all five files are committed. Solution: solutions/01-init-ablation-ref.md (phase open).
Next lab: lab/02-norm-ablation.md.