English · Español
Phase 18 — Stability check decision tree¶
🇪🇸 Cuando un run se porta raro, no improvises. Camina este árbol de arriba a abajo. Cada nodo es una pregunta con un umbral numérico explícito; cada hoja te lleva a un fix concreto o a la Fase 19.
A runnable checklist a learner walks down when a Phase-18 training run misbehaves. Use this before opening Phase 19 — many "training is broken" symptoms are configuration issues, not dynamics issues.
How to use¶
- Have
dashboard.html(or your live metrics) open. - Have
experiments/<run>/manifest.jsonopen (seed, versions, config). - Walk from §1 → §6 in order. Don't skip. The order is by likelihood × cheapness of check.
- The first leaf that matches is your fix. Apply, re-run, re-walk from §1.
§1 — Did the run start?¶
Q1.1 Is the loss at step 0 within \(\ln V \pm 0.5\) where \(V\) is vocab size?
- For §A13 vocab \(\approx 512\): expect \(\ln 512 \approx 6.24\).
- If loss(step 0) > 8.0: init is broken or the LM head bias is non-zero. Go §5.
- If loss(step 0) < 4.0: the model is not untrained. Either the seed is wrong, or you loaded a checkpoint by mistake. Check manifest.json.
- Else: pass, go §2.
Q1.2 Is the first-batch gradient norm in \([0.1, 10]\)? - If grad-norm(step 1) > 100: init is wrong (see §5) or your loss has the wrong reduction (mean vs sum mixed up). - If grad-norm(step 1) < 0.001: data is degenerate — every target token is the same, or the loss mask is all-zero. - Else: pass, go §2.
§2 — Is the LR schedule applying?¶
Q2.1 Plot LR vs step. Does the curve match the configured warmup + cosine?
- If LR is constant ≠ 0: scheduler never stepped. Common bug: optimizer is rebuilt each step, losing the schedule state. Fix: instantiate the optimizer + scheduler once outside the training loop.
- If LR is constant = 0: warmup is configured but the multiplier is being applied to a param_group["lr"] of 0. Fix: initialize param_group["lr"] = lr_max and let the scheduler scale it.
- If LR jumps discontinuously: you're using step decay, not cosine. Match the config.
- Else: pass, go §3.
Q2.2 At step warmup_steps, is LR within 5% of lr_max?
- If LR at step W is < 0.5 × lr_max: off-by-one in the warmup formula. t / W vs (t+1) / W matters at small W.
- Else: pass, go §3.
§3 — Are gradients sane?¶
Q3.1 Plot pre-clip gradient norm. Rolling-mean over a 20-step window.
- If rolling-mean(grad-norm) > 5.0: training is in the "always clipping" regime. Either LR is too high, or batch size is too small, or one batch has a pathological example. Reduce lr_max by 2× and re-walk.
- If rolling-mean(grad-norm) < 0.01: dead training. Either the loss mask zeros out most tokens, or all parameters are frozen, or the LR is so small no updates land.
- Else: pass.
Q3.2 Any single step with grad-norm > 100?
- If yes once in 200 steps: acceptable, clipping handled it. Investigate that batch — note its index.
- If yes more than 3 times in 200 steps: instability. Go to Phase 19 stability-check.md §2 (loss spikes).
- Else: pass, go §4.
§4 — Are losses moving?¶
Q4.1 Plot loss with a 50-step EMA. Does it monotonically decrease over the first 200 steps?
- If train-loss(step 200) > train-loss(step 50): training is diverging or static. Check LR (§2), check optimizer state (§6).
- If train-loss decreases but slope is <0.001 per 100 steps: LR may be too low. Try lr_max × 3 and re-walk.
- Else: pass.
Q4.2 Is val loss tracking train loss within 0.5?
- If val − train > 1.0 by step 500: overfitting fast. Increase weight_decay (try 0.2), or reduce capacity, or increase data (within §A13 scope means more conjugation variants).
- If val < train by > 0.3: data leak. Val set contains train items.
- Else: pass, go §5.
§5 — Is the model architecturally sound?¶
Q5.1 Print initial weight statistics: mean, std, min, max for each named parameter group.
- Expected (with Kaiming for ReLU/GELU or Xavier scaled):
- Linear weight std: \(\sqrt{2 / \text{fan\_in}}\) (Kaiming) or \(\sqrt{2 / (\text{fan\_in} + \text{fan\_out})}\) (Xavier).
- For our mini-GPT: linear weight std should be in \([0.02, 0.1]\).
- Embedding std: 0.02 (the GPT convention).
- Bias and LayerNorm scale: 0 and 1 respectively.
- If std is 100× expected: the bad-init failure from Phase 19. Go to Phase 19.
- If mean ≠ 0 by more than 0.01 on Linear weights: re-init.
Q5.2 Forward one batch, log activation magnitudes at each layer. - Expected: activation Frobenius norm grows by at most ~1.5× per layer. - If a layer's output is > 10× its input: missing or broken normalization. Check LN/RMSNorm placement.
§6 — Is the optimizer state coherent?¶
Q6.1 Are param_groups correctly split for weight decay?
- 2-D+ tensors (weights, embeddings): should be in the decay group with weight_decay = 0.1.
- 1-D tensors (biases, LN scale/bias): should be in the no_decay group with weight_decay = 0.0.
- If everything is in one group: biases drift to zero over training. Fix the split.
Q6.2 Are m and v moments accumulating?
- After step 100, log ||m||_F and ||v||_F for one parameter tensor.
- If \(\|v\|_F\) is near zero at step 100: the optimizer's state isn't persisting across steps. The optimizer is being rebuilt. Fix.
- If \(\|m\|_F = \|g\|_F\) exactly: bias correction is missing. Apply \(\hat m_t = m_t / (1 - \beta_1^t)\).
Q6.3 Is gradient accumulation summing correctly?
- If you accumulate \(k\) micro-batches, the effective gradient is \(\sum / k\), not \(\sum\).
- If grad-norm is exactly \(k\)× expected: you forgot the /k. Fix.
When you reach the bottom¶
If all checks pass and the run is still misbehaving, the issue is dynamics, not configuration. Move to docs/phase-19-training-dynamics/stability-check.md.
If a check pointed you to a fix, apply it, re-run, and re-walk from §1. Most stable runs pass the entire tree in one read.
Numerical thresholds — quick reference card¶
| Symptom | Threshold | First action |
|---|---|---|
| loss(step 0) far from \(\ln V\) | \(\|\Delta\| > 0.5\) | check init |
| grad-norm(step 1) huge | \(> 100\) | check init / loss reduction |
| grad-norm rolling | \(> 5\) | reduce LR 2× |
| grad-norm rolling | \(< 0.01\) | check loss mask |
| val − train gap | \(> 1.0\) by step 500 | increase weight_decay |
| layer activation growth | \(> 10\times\) input | check normalization |
| \(\|v\|_F\) at step 100 | \(\approx 0\) | optimizer state lost |
Sibling doc: docs/phase-19-training-dynamics/stability-check.md — for spikes, NaN, fp16 overflow, divergence.