English · Español
Lab 01 — Scaling-law experiment¶
🇪🇸 Tres corridas pequeñas (5M, 25M, 100M) sobre el mismo presupuesto fijo de FLOPs. Ajustamos la curva de Hoffmann y predecimos el conteo de tokens compute-óptimo para un hipotético modelo de 1B. El objetivo no es reproducir Chinchilla — es medir el principio en miniatura con datos propios.
Goal¶
Train three small models (5M, 25M, 100M params) on a fixed FLOP budget. Fit the Hoffmann parametric form \(L(N, D) = E + A/N^\alpha + B/D^\beta\) to the resulting \((N, D, L)\) triples. Use the fit to predict the Chinchilla-optimal token count for a hypothetical 1B-param model.
Why three sizes?¶
Hoffmann 2022 used >50 IsoFLOP points to fit five parameters. We have three points and want to fit five parameters — the system is underdetermined.
That's the lesson: scaling-law fits at small scale produce error bars wide enough to drive a truck through. The exercise is to measure that — quote the prediction with an honest confidence interval (which will be ~2-3× wide).
Configuration¶
| Run | N (non-embed) | d_model | n_layer | n_head | Wall-clock target | Tokens (D) |
|---|---|---|---|---|---|---|
| A | 5M | 256 | 6 | 4 | 3 h | 4.5B |
| B | 25M | 512 | 6 | 8 | 4 h | 3.0B |
| C | 100M | 1024 | 8 | 16 | 5 h | 1.5B |
Total wall-clock: ~12 hours on the same A100 80GB. Total cost: ~$13.
Fixed FLOP budget per run. All three runs target \(C \approx 1.35 \times 10^{18}\) FLOPs (\(\approx 2\) hours of compute at MFU 0.40; the wall-clock targets above include warmup / val / checkpoint overhead). Verify:
- Run A: \(6 \cdot 5 \times 10^6 \cdot 4.5 \times 10^9 = 1.35 \times 10^{17}\) — wait, that's only \(10^{17}\).
Let me redo with the right numbers. For runs to be IsoFLOP at fixed \(C\):
- Run A: \(6 \cdot 5 \times 10^6 \cdot D = C \Rightarrow D = C / (3 \times 10^7)\)
- Run B: \(D = C / (1.5 \times 10^8)\)
- Run C: \(D = C / (6 \times 10^8)\)
Setting \(C = 1.35 \times 10^{17}\) FLOPs (about 15 min of compute at 150 TFLOP/s):
- Run A: \(D \approx 4.5 \times 10^9 = 4.5\)B tokens.
- Run B: \(D \approx 0.9 \times 10^9 = 900\)M tokens.
- Run C: \(D \approx 0.225 \times 10^9 = 225\)M tokens.
That gives us three points on an IsoFLOP curve. At the IsoFLOP-minimum we expect to see the optimal \(N\) for this \(C\). Run A is under-parameterized (D/N = 900, way over-trained); Run C is over-parameterized (D/N = 2.25, severely under-trained); Run B (D/N = 36) should be near optimal.
The IsoFLOP profile is approximately a parabola in log-\(N\); the bottom of the parabola is \(N_\text{opt}\).
Revised wall-clock estimates¶
At 150 TFLOP/s sustained: - \(C = 1.35 \times 10^{17}\) FLOPs → 900 s ≈ 15 min compute per run. - Add ~5 min warmup + ~5 min val + ~5 min checkpoint = 30 min wall-clock per run. - Three runs: ~90 min compute total. - Plus initial setup / data load / final eval: ~2 hours total wall-clock.
Revised cost: 2 h × \(1.10/hr = ~\)2.20 for the IsoFLOP triple. (Original $13 estimate is for a larger compute budget per point; we drop it to keep lab 01 cheap.)
Updated budget¶
| Line | $-cost | Notes |
|---|---|---|
| 1× A100 80GB spot, Lambda, 2 h × $1.10/hr | $2.20 | Three IsoFLOP runs |
| Persistent disk (reused from lab 00) | $0 | Already attached |
| Buffer | $0.80 | Restart slack |
| Lab 01 ceiling | $3 | Cheap |
If you ran lab 00 first and re-use the same instance + data + checkpoints dir, lab 01 fits in $3.
The data¶
Reuse the FineWeb-Edu tokenized binary from lab 00 (/workspace/data/tokenized/train.bin).
For Run A's 4.5B tokens, we exceed the 10B-sample's distinct tokens if we want to avoid duplication. Allow ~1.5 epochs (4.5B / 10B = 0.45 epochs of the full sample; well under 1 epoch — fine).
The training command (per run)¶
Same script as lab 00, different config:
# Run A (5M params, 4.5B tokens)
python -m x1_pretrain.train \
--config configs/x1-isoflop-A.yaml \
--total-steps 8800 \
--eval-every 880 \
--mlflow-uri file:///workspace/mlruns \
--mlflow-experiment isoflop-A
# Run B (25M, 900M tokens)
python -m x1_pretrain.train \
--config configs/x1-isoflop-B.yaml \
--total-steps 1760 \
--eval-every 176 \
--mlflow-experiment isoflop-B
# Run C (100M, 225M tokens)
python -m x1_pretrain.train \
--config configs/x1-isoflop-C.yaml \
--total-steps 440 \
--eval-every 44 \
--mlflow-experiment isoflop-C
Step counts assume effective batch tokens = 512K from lab 00: - Run A: 4.5B / 512K ≈ 8800 steps. - Run B: 900M / 512K ≈ 1760 steps. - Run C: 225M / 512K ≈ 440 steps.
Expected val losses¶
Predictions from the Hoffmann fit (theory/01-scaling-laws.md, with \(E=1.69\), \(A=406.4\), \(\alpha=0.34\), \(B=410.7\), \(\beta=0.28\)):
| Run | N | D | Predicted L | Realistic range |
|---|---|---|---|---|
| A | 5M | 4.5B | 4.66 | [4.4, 5.0] |
| B | 25M | 900M | 4.40 | [4.2, 4.6] |
| C | 100M | 225M | 4.48 | [4.3, 4.7] |
Note that Run B's predicted loss is lowest — that matches the IsoFLOP-minimum intuition. The realistic ranges are wide because (a) the fit was made at much larger scale and extrapolates worse at small \(N\), and (b) FineWeb-Edu is not the data Hoffmann fit on (he used MassiveText).
The fit¶
After all three runs complete, fit the Hoffmann form to the (N, D, L) triples.
python -m x1_pretrain.fit_scaling_law \
--input experiments/X1-pretraining/scaling-law/runs.csv \
--output experiments/X1-pretraining/scaling-law/fit.json \
--bootstrap 1000
The script:
1. Reads (N, D, L) from runs.csv.
2. Fits \(L = E + A/N^\alpha + B/D^\beta\) via scipy.optimize.curve_fit.
3. Bootstraps 1000 resamples (resamples residuals; with only 3 points the CI is dominated by the residual noise).
4. Outputs fit.json with point estimates and 95% CIs for each of (E, A, α, B, β).
5. Renders iso-flop.png showing the three points and the fitted curve.
Expected output (illustrative):
{
"E": {"point": 2.1, "ci95": [1.4, 2.9]},
"A": {"point": 380, "ci95": [120, 950]},
"alpha": {"point": 0.32, "ci95": [0.20, 0.46]},
"B": {"point": 290, "ci95": [80, 700]},
"beta": {"point": 0.26, "ci95": [0.15, 0.42]}
}
The CIs are wide because three points cannot pin down five parameters. That's expected and is the lesson.
The prediction¶
Given the fit, predict \(D_\text{opt}\) for \(N = 1\)B:
(Derivation: minimize \(L(N, D)\) over \(D\) at fixed \(C = 6ND\) — gives the IsoFLOP-optimal scaling.)
Substituting the point estimates:
Compute step by step: - \((10^9)^{0.32} = 10^{2.88} \approx 760\). - Numerator: \(290 \cdot 0.26 \cdot 760 = 57{,}300\). - Denominator: \(380 \cdot 0.32 = 121.6\). - Ratio: \(471\). - \(471^{1/0.26} = 471^{3.85} = ?\) — \(\log_{10}(471) \approx 2.673 \cdot 3.85 = 10.29\); \(10^{10.29} \approx 1.95 \times 10^{10}\).
So \(D_\text{opt}(10^9) \approx 20 \times 10^9 = 20\)B tokens — gives \(D/N \approx 20\), matching Chinchilla.
(The number you actually get from your fit may be 8B or 80B — anything in [3B, 100B] is within the 3-point CI. Quote the CI honestly.)
What this teaches¶
- The headline number (20 tokens/param) is robust — even with three points at small scale and slightly different data, you converge near the Chinchilla number. The "20" is a deep structural property of language modeling, not a quirk of MassiveText.
- The error bars are huge at small N. Predicting from 5M-100M to 1B is 2 orders of magnitude of extrapolation. The frontier-lab versions use 10-100B-param points and still quote uncertainty.
- IsoFLOP curves are nearly flat near the optimum. Run A and Run C are off the optimum, but their loss is only ~0.1 nat worse than Run B. The penalty for being 4× off-optimal is modest — which is why the field tolerated Kaplan's recipe for two years.
- Data quality interacts with the law. FineWeb-Edu's filtered tokens are individually more informative than MassiveText's. Your fitted \(E\) should be lower than Hoffmann's 1.69 by ~0.1-0.3, because the irreducible entropy of "high-quality educational text" is below that of "all web text."
Deliverables (DoD check 2)¶
experiments/X1-pretraining/scaling-law/runs.csvwith rows (run_id, N, D, val_loss, $-spent).experiments/X1-pretraining/scaling-law/fit.jsonwith parameter estimates + CIs.experiments/X1-pretraining/scaling-law/iso-flop.pngshowing the three points and the fitted curve.experiments/X1-pretraining/scaling-law/prediction.mdwith:- Predicted \(D_\text{opt}(1\text{B})\) with 95% CI.
- The implied \(D/N\) ratio with CI.
- One paragraph comparing to the published Chinchilla number.
- One paragraph honest about the extrapolation risk.
What can go wrong and how to respond¶
- Run C diverges. 100M on 225M tokens is severely under-trained — the model is too big for the data and the loss can be noisier than the fit assumes. Run anyway; quote the noisy loss; the wide CI absorbs it.
- Run A overfits the val set. With 4.5B tokens and only 1k val docs from the same distribution, val loss can dip below train loss. Use a held-out shard (last 10% of token range) — never overlap.
- Throughput differs across runs. Smaller models often hit lower MFU because their batch is too small to keep the matmul cores busy. Reduce
batch_sizeand addgrad_accumfor Run A to recover throughput, or accept it (the IsoFLOP target is fixed by \(C\), not by wall-clock). - The fit fails to converge.
scipy.optimize.curve_fitcan fail with bad initial guesses. Use Hoffmann's published values as the initialization.
End of Extension Module X1. After lab 01 completes, write learners/borja/extension-X1/reflections.md per the README DoD.