Skip to content

English · Español

Lab 00 — Variance walk: see activations explode/collapse

Goal: empirically observe what theory 01 predicts — forward-pass variance evolves layer-by-layer as a function of σ_W².

Estimated time: 45–60 minutes.

Prereq: theory/00-motivation.md and theory/01-initialization.md read.


What you produce

A directory experiments/10-variance-walk/ containing:

  • walk.py — your measurement script.
  • results.json — per-init, per-layer activation variance.
  • variance.png — semilog-y plot of activation variance vs layer index, three curves (uniform, Xavier, Kaiming).
  • manifest.json{seed, versions, config, hardware} per LYNX_CORTEX.md §5.
  • README.md (2–3 paragraphs) explaining what you saw and how it matches (or doesn't) the formula Var(y) = n_in · σ_W² · Var(x).

The setup

A 20-layer MLP, hidden dim 256, no activation (linear-only). You're isolating the linear-layer variance behavior from any activation effect.

Take a fixed input vector (or batch) with unit variance. Forward through 20 linear layers. At each layer, record the empirical variance of the activations.

Three runs, identical except for the init scheme:

  1. Uniform [-0.5, 0.5] (bad — variance 1/12 ≈ 0.083, doesn't scale with n_in).
  2. Xavier N(0, 1/n_in).
  3. Kaiming N(0, 2/n_in).

Phase 10's variance walk is linear-only — no activation. This isolates the init effect from the activation effect (we'll add the activation in lab 01).

TODOs

Block A — write the variance walker

  • 20-layer MLP, hidden 256. Use np.empty + np.dot directly, or build it on minitorch Linear modules. Either works; the latter exercises the Phase 9 stack.
  • Initialize per the three schemes. Seed each init separately so the three runs are reproducible.
  • Forward a batch of 64 samples, each a 256-dim Gaussian with variance 1.
  • At each layer's output, record: layer_idx, var_empirical, var_theoretical_prediction. var_theoretical = previous-layer variance × n_in × σ_W².
  • Save as results.json.

Block B — plot

  • matplotlib. x-axis: layer index (0 to 20). y-axis: variance, log scale.
  • Three curves: uniform, Xavier, Kaiming.
  • On the same axes: the theoretical curves as dashed lines. Empirical should track theoretical to within ~10%.
  • Save variance.png.

Block C — interpret

In README.md, answer:

  1. What does the uniform-init curve do? Decay or explode? By what factor per layer? Match the formula.
  2. What does the Xavier curve do? Should hover near 1.0 across layers. Yours should too (within batch-variance noise).
  3. Why is the Xavier curve not exactly 1.0? It dips/drifts a bit. Why? (Hint: the formula assumes infinite-batch limit; you're using 64 samples.)
  4. What would change if you used ReLU activation between layers? Predict the Xavier and Kaiming curves under ReLU. Bonus: re-run with ReLU as a side experiment and confirm.

Block D — manifest

{
  "experiment": "10-variance-walk",
  "date": "YYYY-MM-DD",
  "seed": 42,
  "versions": { "python": "3.11.x", "numpy": "...", "minigrad": "..." },
  "hardware": { ... },
  "config": {
    "depth": 20,
    "hidden_dim": 256,
    "batch_size": 64,
    "input_variance": 1.0,
    "inits": ["uniform[-0.5, 0.5]", "xavier N(0, 1/n)", "kaiming N(0, 2/n)"]
  },
  "results_summary": {
    "uniform_final_layer_variance": null,
    "xavier_final_layer_variance": null,
    "kaiming_final_layer_variance": null
  }
}

Constraints

  • No activation between linear layers. This isolates init from activation. Lab 01 will add the activation.
  • No norm. Same reason.
  • No backward. Forward only. Backward variance is theory 01's other half — we'll cover it numerically in a future lab if needed.
  • Single thread. No need for parallelism here; the computation is cheap.

Stop conditions

You're done when:

  1. The directory has all five files.
  2. variance.png shows: uniform exploding by ~order-of-magnitude per layer, Xavier holding near 1.0, Kaiming holding near 2.0 (because it's tuned for ReLU, not linear).
  3. README answers all four Block C questions.
  4. You can re-derive the variance formula on paper without looking at theory 01.

Pitfalls (read before debugging)

  • Variance computed wrong. np.var(activations) computes scalar variance across all elements; you want per-batch-per-feature semantics. Check the axis. Default-axis variance over (B, d) = (64, 256) is fine for our purposes since features are i.i.d. by init.
  • Uniform curve goes flat instead of exploding. Check the bounds. uniform[-0.5, 0.5] has variance 1/12 ≈ 0.083. Multiplied by n_in = 256 → ratio per layer ≈ 21×. By layer 20 → 21^20 ≈ 10^26. If your curve is flat, you're probably normalizing somewhere.
  • Xavier curve drifts upward. Could be off-by-one fan-in (using n_out instead of n_in). Recheck.
  • Floating point overflow at layer 7-ish in the uniform run. Expected. Your script must catch the inf/nan and not crash. Log a "saturated" marker and continue.

Hint of last resort

If you've spent 60 minutes and can't get Xavier to stabilize: check np.random.seed(seed); W = np.random.randn(n_out, n_in) * np.sqrt(1 / n_in). That's Xavier. If (n_out, n_in) is flipped, the variance still works forward by accident, but the matmul shape doesn't.

When to consult solutions/

After all five files are committed. Solution: solutions/00-variance-walk-ref.md (phase open).


Next lab: lab/01-init-ablation.md.