English · Español

Lab 01 — Scaling-law experiment¶

🇪🇸 Tres corridas pequeñas (5M, 25M, 100M) sobre el mismo presupuesto fijo de FLOPs. Ajustamos la curva de Hoffmann y predecimos el conteo de tokens compute-óptimo para un hipotético modelo de 1B. El objetivo no es reproducir Chinchilla — es medir el principio en miniatura con datos propios.

Goal¶

Train three small models (5M, 25M, 100M params) on a fixed FLOP budget. Fit the Hoffmann parametric form $L(N, D) = E + A/N^\alpha + B/D^\beta$ to the resulting $(N, D, L)$ triples. Use the fit to predict the Chinchilla-optimal token count for a hypothetical 1B-param model.

Why three sizes?¶

Hoffmann 2022 used >50 IsoFLOP points to fit five parameters. We have three points and want to fit five parameters — the system is underdetermined.

That's the lesson: scaling-law fits at small scale produce error bars wide enough to drive a truck through. The exercise is to measure that — quote the prediction with an honest confidence interval (which will be ~2-3× wide).

Configuration¶

Run	N (non-embed)	d_model	n_layer	n_head	Wall-clock target	Tokens (D)
A	5M	256	6	4	3 h	4.5B
B	25M	512	6	8	4 h	3.0B
C	100M	1024	8	16	5 h	1.5B

Total wall-clock: ~12 hours on the same A100 80GB. Total cost: ~$13.

Fixed FLOP budget per run. All three runs target $C \approx 1.35 \times 10^{18}$ FLOPs ($\approx 2$ hours of compute at MFU 0.40; the wall-clock targets above include warmup / val / checkpoint overhead). Verify:

Run A: $6 \cdot 5 \times 10^6 \cdot 4.5 \times 10^9 = 1.35 \times 10^{17}$ — wait, that's only $10^{17}$.

Let me redo with the right numbers. For runs to be IsoFLOP at fixed $C$:

Run A: $6 \cdot 5 \times 10^6 \cdot D = C \Rightarrow D = C / (3 \times 10^7)$
Run B: $D = C / (1.5 \times 10^8)$
Run C: $D = C / (6 \times 10^8)$

Setting $C = 1.35 \times 10^{17}$ FLOPs (about 15 min of compute at 150 TFLOP/s):

Run A: $D \approx 4.5 \times 10^9 = 4.5$B tokens.
Run B: $D \approx 0.9 \times 10^9 = 900$M tokens.
Run C: $D \approx 0.225 \times 10^9 = 225$M tokens.

That gives us three points on an IsoFLOP curve. At the IsoFLOP-minimum we expect to see the optimal $N$ for this $C$. Run A is under-parameterized (D/N = 900, way over-trained); Run C is over-parameterized (D/N = 2.25, severely under-trained); Run B (D/N = 36) should be near optimal.

The IsoFLOP profile is approximately a parabola in log-$N$; the bottom of the parabola is $N_\text{opt}$.

Revised wall-clock estimates¶

At 150 TFLOP/s sustained: - $C = 1.35 \times 10^{17}$ FLOPs → 900 s ≈ 15 min compute per run. - Add ~5 min warmup + ~5 min val + ~5 min checkpoint = 30 min wall-clock per run. - Three runs: ~90 min compute total. - Plus initial setup / data load / final eval: ~2 hours total wall-clock.

Revised cost: 2 h × $1.10/hr = ~$2.20 for the IsoFLOP triple. (Original $13 estimate is for a larger compute budget per point; we drop it to keep lab 01 cheap.)

Updated budget¶

Line	$-cost	Notes
1× A100 80GB spot, Lambda, 2 h × $1.10/hr	$2.20	Three IsoFLOP runs
Persistent disk (reused from lab 00)	$0	Already attached
Buffer	$0.80	Restart slack
Lab 01 ceiling	$3	Cheap

If you ran lab 00 first and re-use the same instance + data + checkpoints dir, lab 01 fits in $3.

The data¶

Reuse the FineWeb-Edu tokenized binary from lab 00 (/workspace/data/tokenized/train.bin).

For Run A's 4.5B tokens, we exceed the 10B-sample's distinct tokens if we want to avoid duplication. Allow ~1.5 epochs (4.5B / 10B = 0.45 epochs of the full sample; well under 1 epoch — fine).

The training command (per run)¶

Same script as lab 00, different config:

# Run A (5M params, 4.5B tokens)
python -m x1_pretrain.train \
  --config configs/x1-isoflop-A.yaml \
  --total-steps 8800 \
  --eval-every 880 \
  --mlflow-uri file:///workspace/mlruns \
  --mlflow-experiment isoflop-A

# Run B (25M, 900M tokens)
python -m x1_pretrain.train \
  --config configs/x1-isoflop-B.yaml \
  --total-steps 1760 \
  --eval-every 176 \
  --mlflow-experiment isoflop-B

# Run C (100M, 225M tokens)
python -m x1_pretrain.train \
  --config configs/x1-isoflop-C.yaml \
  --total-steps 440 \
  --eval-every 44 \
  --mlflow-experiment isoflop-C

Step counts assume effective batch tokens = 512K from lab 00: - Run A: 4.5B / 512K ≈ 8800 steps. - Run B: 900M / 512K ≈ 1760 steps. - Run C: 225M / 512K ≈ 440 steps.

Expected val losses¶

Predictions from the Hoffmann fit (theory/01-scaling-laws.md, with $E=1.69$, $A=406.4$, $\alpha=0.34$, $B=410.7$, $\beta=0.28$):

Run	N	D	Predicted L	Realistic range
A	5M	4.5B	4.66	[4.4, 5.0]
B	25M	900M	4.40	[4.2, 4.6]
C	100M	225M	4.48	[4.3, 4.7]

Note that Run B's predicted loss is lowest — that matches the IsoFLOP-minimum intuition. The realistic ranges are wide because (a) the fit was made at much larger scale and extrapolates worse at small $N$, and (b) FineWeb-Edu is not the data Hoffmann fit on (he used MassiveText).

The fit¶

After all three runs complete, fit the Hoffmann form to the (N, D, L) triples.

python -m x1_pretrain.fit_scaling_law \
  --input experiments/X1-pretraining/scaling-law/runs.csv \
  --output experiments/X1-pretraining/scaling-law/fit.json \
  --bootstrap 1000

The script: 1. Reads (N, D, L) from runs.csv. 2. Fits $L = E + A/N^\alpha + B/D^\beta$ via scipy.optimize.curve_fit. 3. Bootstraps 1000 resamples (resamples residuals; with only 3 points the CI is dominated by the residual noise). 4. Outputs fit.json with point estimates and 95% CIs for each of (E, A, α, B, β). 5. Renders iso-flop.png showing the three points and the fitted curve.

Expected output (illustrative):

{
  "E":     {"point": 2.1, "ci95": [1.4, 2.9]},
  "A":     {"point": 380, "ci95": [120, 950]},
  "alpha": {"point": 0.32, "ci95": [0.20, 0.46]},
  "B":     {"point": 290, "ci95": [80, 700]},
  "beta":  {"point": 0.26, "ci95": [0.15, 0.42]}
}

The CIs are wide because three points cannot pin down five parameters. That's expected and is the lesson.

The prediction¶

Given the fit, predict $D_\text{opt}$ for $N = 1$B:

\[D_\text{opt}(N=10^9) = \left( \frac{B \beta N^\alpha}{A \alpha} \right)^{1/\beta} \cdot 10^9\]

(Derivation: minimize $L(N, D)$ over $D$ at fixed $C = 6ND$ — gives the IsoFLOP-optimal scaling.)

Substituting the point estimates:

\[D_\text{opt}(10^9) = \left( \frac{290 \cdot 0.26 \cdot (10^9)^{0.32}}{380 \cdot 0.32} \right)^{1/0.26} \cdot 10^9\]

Compute step by step: - $(10^9)^{0.32} = 10^{2.88} \approx 760$. - Numerator: $290 \cdot 0.26 \cdot 760 = 57{,}300$. - Denominator: $380 \cdot 0.32 = 121.6$. - Ratio: $471$. - $471^{1/0.26} = 471^{3.85} = ?$ — $\log_{10}(471) \approx 2.673 \cdot 3.85 = 10.29$; $10^{10.29} \approx 1.95 \times 10^{10}$.

So $D_\text{opt}(10^9) \approx 20 \times 10^9 = 20$B tokens — gives $D/N \approx 20$, matching Chinchilla.

(The number you actually get from your fit may be 8B or 80B — anything in [3B, 100B] is within the 3-point CI. Quote the CI honestly.)

What this teaches¶

The headline number (20 tokens/param) is robust — even with three points at small scale and slightly different data, you converge near the Chinchilla number. The "20" is a deep structural property of language modeling, not a quirk of MassiveText.
The error bars are huge at small N. Predicting from 5M-100M to 1B is 2 orders of magnitude of extrapolation. The frontier-lab versions use 10-100B-param points and still quote uncertainty.
IsoFLOP curves are nearly flat near the optimum. Run A and Run C are off the optimum, but their loss is only ~0.1 nat worse than Run B. The penalty for being 4× off-optimal is modest — which is why the field tolerated Kaplan's recipe for two years.
Data quality interacts with the law. FineWeb-Edu's filtered tokens are individually more informative than MassiveText's. Your fitted $E$ should be lower than Hoffmann's 1.69 by ~0.1-0.3, because the irreducible entropy of "high-quality educational text" is below that of "all web text."

Deliverables (DoD check 2)¶

experiments/X1-pretraining/scaling-law/runs.csv with rows (run_id, N, D, val_loss, $-spent).
experiments/X1-pretraining/scaling-law/fit.json with parameter estimates + CIs.
experiments/X1-pretraining/scaling-law/iso-flop.png showing the three points and the fitted curve.
experiments/X1-pretraining/scaling-law/prediction.md with:
Predicted $D_\text{opt}(1\text{B})$ with 95% CI.
The implied $D/N$ ratio with CI.
One paragraph comparing to the published Chinchilla number.
One paragraph honest about the extrapolation risk.

What can go wrong and how to respond¶

Run C diverges. 100M on 225M tokens is severely under-trained — the model is too big for the data and the loss can be noisier than the fit assumes. Run anyway; quote the noisy loss; the wide CI absorbs it.
Run A overfits the val set. With 4.5B tokens and only 1k val docs from the same distribution, val loss can dip below train loss. Use a held-out shard (last 10% of token range) — never overlap.
Throughput differs across runs. Smaller models often hit lower MFU because their batch is too small to keep the matmul cores busy. Reduce batch_size and add grad_accum for Run A to recover throughput, or accept it (the IsoFLOP target is fixed by $C$, not by wall-clock).
The fit fails to converge. scipy.optimize.curve_fit can fail with bad initial guesses. Use Hoffmann's published values as the initialization.

End of Extension Module X1. After lab 01 completes, write learners/borja/extension-X1/reflections.md per the README DoD.