Skip to content

English · Español

Phase 18 — Stability check decision tree

🇪🇸 Cuando un run se porta raro, no improvises. Camina este árbol de arriba a abajo. Cada nodo es una pregunta con un umbral numérico explícito; cada hoja te lleva a un fix concreto o a la Fase 19.


A runnable checklist a learner walks down when a Phase-18 training run misbehaves. Use this before opening Phase 19 — many "training is broken" symptoms are configuration issues, not dynamics issues.

How to use

  1. Have dashboard.html (or your live metrics) open.
  2. Have experiments/<run>/manifest.json open (seed, versions, config).
  3. Walk from §1 → §6 in order. Don't skip. The order is by likelihood × cheapness of check.
  4. The first leaf that matches is your fix. Apply, re-run, re-walk from §1.

§1 — Did the run start?

Q1.1 Is the loss at step 0 within \(\ln V \pm 0.5\) where \(V\) is vocab size? - For §A13 vocab \(\approx 512\): expect \(\ln 512 \approx 6.24\). - If loss(step 0) > 8.0: init is broken or the LM head bias is non-zero. Go §5. - If loss(step 0) < 4.0: the model is not untrained. Either the seed is wrong, or you loaded a checkpoint by mistake. Check manifest.json. - Else: pass, go §2.

Q1.2 Is the first-batch gradient norm in \([0.1, 10]\)? - If grad-norm(step 1) > 100: init is wrong (see §5) or your loss has the wrong reduction (mean vs sum mixed up). - If grad-norm(step 1) < 0.001: data is degenerate — every target token is the same, or the loss mask is all-zero. - Else: pass, go §2.


§2 — Is the LR schedule applying?

Q2.1 Plot LR vs step. Does the curve match the configured warmup + cosine? - If LR is constant ≠ 0: scheduler never stepped. Common bug: optimizer is rebuilt each step, losing the schedule state. Fix: instantiate the optimizer + scheduler once outside the training loop. - If LR is constant = 0: warmup is configured but the multiplier is being applied to a param_group["lr"] of 0. Fix: initialize param_group["lr"] = lr_max and let the scheduler scale it. - If LR jumps discontinuously: you're using step decay, not cosine. Match the config. - Else: pass, go §3.

Q2.2 At step warmup_steps, is LR within 5% of lr_max? - If LR at step W is < 0.5 × lr_max: off-by-one in the warmup formula. t / W vs (t+1) / W matters at small W. - Else: pass, go §3.


§3 — Are gradients sane?

Q3.1 Plot pre-clip gradient norm. Rolling-mean over a 20-step window. - If rolling-mean(grad-norm) > 5.0: training is in the "always clipping" regime. Either LR is too high, or batch size is too small, or one batch has a pathological example. Reduce lr_max by 2× and re-walk. - If rolling-mean(grad-norm) < 0.01: dead training. Either the loss mask zeros out most tokens, or all parameters are frozen, or the LR is so small no updates land. - Else: pass.

Q3.2 Any single step with grad-norm > 100? - If yes once in 200 steps: acceptable, clipping handled it. Investigate that batch — note its index. - If yes more than 3 times in 200 steps: instability. Go to Phase 19 stability-check.md §2 (loss spikes). - Else: pass, go §4.


§4 — Are losses moving?

Q4.1 Plot loss with a 50-step EMA. Does it monotonically decrease over the first 200 steps? - If train-loss(step 200) > train-loss(step 50): training is diverging or static. Check LR (§2), check optimizer state (§6). - If train-loss decreases but slope is <0.001 per 100 steps: LR may be too low. Try lr_max × 3 and re-walk. - Else: pass.

Q4.2 Is val loss tracking train loss within 0.5? - If val − train > 1.0 by step 500: overfitting fast. Increase weight_decay (try 0.2), or reduce capacity, or increase data (within §A13 scope means more conjugation variants). - If val < train by > 0.3: data leak. Val set contains train items. - Else: pass, go §5.


§5 — Is the model architecturally sound?

Q5.1 Print initial weight statistics: mean, std, min, max for each named parameter group. - Expected (with Kaiming for ReLU/GELU or Xavier scaled): - Linear weight std: \(\sqrt{2 / \text{fan\_in}}\) (Kaiming) or \(\sqrt{2 / (\text{fan\_in} + \text{fan\_out})}\) (Xavier). - For our mini-GPT: linear weight std should be in \([0.02, 0.1]\). - Embedding std: 0.02 (the GPT convention). - Bias and LayerNorm scale: 0 and 1 respectively. - If std is 100× expected: the bad-init failure from Phase 19. Go to Phase 19. - If mean ≠ 0 by more than 0.01 on Linear weights: re-init.

Q5.2 Forward one batch, log activation magnitudes at each layer. - Expected: activation Frobenius norm grows by at most ~1.5× per layer. - If a layer's output is > 10× its input: missing or broken normalization. Check LN/RMSNorm placement.


§6 — Is the optimizer state coherent?

Q6.1 Are param_groups correctly split for weight decay? - 2-D+ tensors (weights, embeddings): should be in the decay group with weight_decay = 0.1. - 1-D tensors (biases, LN scale/bias): should be in the no_decay group with weight_decay = 0.0. - If everything is in one group: biases drift to zero over training. Fix the split.

Q6.2 Are m and v moments accumulating? - After step 100, log ||m||_F and ||v||_F for one parameter tensor. - If \(\|v\|_F\) is near zero at step 100: the optimizer's state isn't persisting across steps. The optimizer is being rebuilt. Fix. - If \(\|m\|_F = \|g\|_F\) exactly: bias correction is missing. Apply \(\hat m_t = m_t / (1 - \beta_1^t)\).

Q6.3 Is gradient accumulation summing correctly? - If you accumulate \(k\) micro-batches, the effective gradient is \(\sum / k\), not \(\sum\). - If grad-norm is exactly \(k\)× expected: you forgot the /k. Fix.


When you reach the bottom

If all checks pass and the run is still misbehaving, the issue is dynamics, not configuration. Move to docs/phase-19-training-dynamics/stability-check.md.

If a check pointed you to a fix, apply it, re-run, and re-walk from §1. Most stable runs pass the entire tree in one read.

Numerical thresholds — quick reference card

Symptom Threshold First action
loss(step 0) far from \(\ln V\) \(\|\Delta\| > 0.5\) check init
grad-norm(step 1) huge \(> 100\) check init / loss reduction
grad-norm rolling \(> 5\) reduce LR 2×
grad-norm rolling \(< 0.01\) check loss mask
val − train gap \(> 1.0\) by step 500 increase weight_decay
layer activation growth \(> 10\times\) input check normalization
\(\|v\|_F\) at step 100 \(\approx 0\) optimizer state lost

Sibling doc: docs/phase-19-training-dynamics/stability-check.md — for spikes, NaN, fp16 overflow, divergence.