Skip to content

English · Español

Lab 01 — Init ablation: the headline three-curve experiment

Goal: show that init alone — same data, same architecture, same optimizer — turns a network from "trains" to "doesn't train".

Estimated time: 90–120 minutes.

Prereq: lab 00 committed.


What you produce

A directory experiments/10-init-matters/ containing:

  • train.py — your training script.
  • losses.json — three loss trajectories (uniform, Xavier, Kaiming).
  • loss_curves.png — three curves on the same axes.
  • manifest.json — standard.
  • README.md (3–4 paragraphs) explaining what you trained and what happened.

The setup

A 12-layer MLP with ReLU activations, hidden dim 256, trained for 1000 steps on the toy classification problem from Phase 9 (or generated synthetically here — see below).

Three identical runs differing only in init:

  1. Uniform [-0.5, 0.5].
  2. Xavier N(0, 1/n_in).
  3. Kaiming N(0, 2/n_in).

Expected outcome:

  • Uniform: forward-pass blow-up → NaN within ~20 steps, or saturated ReLU → loss flatlines at random.
  • Xavier: trains, but slowly. Loss curve descends sluggishly.
  • Kaiming: trains as expected. Smooth descent.

The headline figure is the three curves on the same axes, clearly differentiated.

Toy data

If you didn't already build the Phase 9 toy "safe/unsafe token" dataset, generate a synthetic one:

  • 1000 training examples, 200 val.
  • Input: 256-dim Gaussian vectors.
  • Label: 0 if the sum of the first 10 elements is negative, 1 otherwise (deterministic, linearly separable in the first 10 dims).

This is trivially learnable by a 1-layer net. The point of the experiment is not task difficulty — it's that bad init turns a trivially learnable task into an unsolvable one.

TODOs

Block A — write the training loop

  • 12-layer MLP, hidden 256, ReLU activations between layers, no norm, no residual.
  • Use minitorch.nn.Linear + minitorch.nn.ReLU from Phase 9.
  • Loss: cross-entropy.
  • Optimizer: SGD with momentum 0.9, LR 1e-3.
  • Batch size 64.
  • 1000 steps total.
  • Log loss every step.

Block B — three runs

  • Run with init = uniform[-0.5, 0.5]. Save loss trajectory.
  • Run with init = Xavier. Save.
  • Run with init = Kaiming. Save.
  • Same seed for data shuffling across runs (so the three runs see identical batch sequences).
  • Different seed for each init's weight sampling (so the inits themselves are independent draws).

Block C — plot

  • matplotlib. x-axis: step (0 to 1000). y-axis: loss (linear or log — try both, pick the more informative).
  • Three curves with distinct colors and a legend.
  • Annotate: NaN markers if any run diverges; flatline markers if any run plateaus at random loss.
  • Save as loss_curves.png.

Block D — interpret

In README.md, answer:

  1. At what step did uniform init produce its first NaN (or saturation)?
  2. What is Xavier's final loss vs Kaiming's?
  3. Is Xavier training or almost not training? Compute the slope of its loss curve over the last 200 steps.
  4. Predict, without running: if you increased the depth from 12 to 24 layers, how would each of the three curves change? Briefly justify.

Block E — manifest

Standard. results_summary should include {steps_to_nan_uniform, final_loss_xavier, final_loss_kaiming}.

Constraints

  • No normalization. Lab 02 is for that.
  • No residuals. Lab 03 is for that.
  • Same seed for data, independent seeds for weights. Don't conflate.
  • Run on performance governor. Variance in step time would noise the comparison.

Stop conditions

Done when:

  1. The directory has all five files.
  2. The three-curve plot is committed and visually crisp.
  3. The uniform run diverges or flatlines; the Kaiming run trains; the Xavier run is somewhere in between.
  4. The README answers all four Block D questions.

Pitfalls

  • All three curves look identical. You probably aren't actually re-initializing between runs. Check that you model = build_mlp(init=...) each time, not just optimizer.zero_grad().
  • All three curves diverge. Your LR is too high. Drop to 1e-4 and re-try.
  • All three curves train fine. Your depth is too shallow. Crank to 24 layers.
  • Uniform doesn't NaN, just plateaus. Could be that ReLU is saturating (all zero) — the gradient is zero everywhere, no NaN, but no learning. Note this in the README; it's a different failure mode than blow-up but the same underlying cause.

Hint of last resort

If 90 minutes in and uniform-init still trains: print np.var(model.layers[0].weight.data) after init. It should be 1/12 ≈ 0.083. If you accidentally drew uniform [-0.05, 0.05], variance is ≈ 0.0008 — basically Xavier-ish. Recheck your init bounds.

When to consult solutions/

After all five files are committed. Solution: solutions/01-init-ablation-ref.md (phase open).


Next lab: lab/02-norm-ablation.md.