Skip to content

English · Español

Lab 01 — Scaling-law experiment

🇪🇸 Tres corridas pequeñas (5M, 25M, 100M) sobre el mismo presupuesto fijo de FLOPs. Ajustamos la curva de Hoffmann y predecimos el conteo de tokens compute-óptimo para un hipotético modelo de 1B. El objetivo no es reproducir Chinchilla — es medir el principio en miniatura con datos propios.

Goal

Train three small models (5M, 25M, 100M params) on a fixed FLOP budget. Fit the Hoffmann parametric form \(L(N, D) = E + A/N^\alpha + B/D^\beta\) to the resulting \((N, D, L)\) triples. Use the fit to predict the Chinchilla-optimal token count for a hypothetical 1B-param model.

Why three sizes?

Hoffmann 2022 used >50 IsoFLOP points to fit five parameters. We have three points and want to fit five parameters — the system is underdetermined.

That's the lesson: scaling-law fits at small scale produce error bars wide enough to drive a truck through. The exercise is to measure that — quote the prediction with an honest confidence interval (which will be ~2-3× wide).

Configuration

Run N (non-embed) d_model n_layer n_head Wall-clock target Tokens (D)
A 5M 256 6 4 3 h 4.5B
B 25M 512 6 8 4 h 3.0B
C 100M 1024 8 16 5 h 1.5B

Total wall-clock: ~12 hours on the same A100 80GB. Total cost: ~$13.

Fixed FLOP budget per run. All three runs target \(C \approx 1.35 \times 10^{18}\) FLOPs (\(\approx 2\) hours of compute at MFU 0.40; the wall-clock targets above include warmup / val / checkpoint overhead). Verify:

  • Run A: \(6 \cdot 5 \times 10^6 \cdot 4.5 \times 10^9 = 1.35 \times 10^{17}\) — wait, that's only \(10^{17}\).

Let me redo with the right numbers. For runs to be IsoFLOP at fixed \(C\):

  • Run A: \(6 \cdot 5 \times 10^6 \cdot D = C \Rightarrow D = C / (3 \times 10^7)\)
  • Run B: \(D = C / (1.5 \times 10^8)\)
  • Run C: \(D = C / (6 \times 10^8)\)

Setting \(C = 1.35 \times 10^{17}\) FLOPs (about 15 min of compute at 150 TFLOP/s):

  • Run A: \(D \approx 4.5 \times 10^9 = 4.5\)B tokens.
  • Run B: \(D \approx 0.9 \times 10^9 = 900\)M tokens.
  • Run C: \(D \approx 0.225 \times 10^9 = 225\)M tokens.

That gives us three points on an IsoFLOP curve. At the IsoFLOP-minimum we expect to see the optimal \(N\) for this \(C\). Run A is under-parameterized (D/N = 900, way over-trained); Run C is over-parameterized (D/N = 2.25, severely under-trained); Run B (D/N = 36) should be near optimal.

The IsoFLOP profile is approximately a parabola in log-\(N\); the bottom of the parabola is \(N_\text{opt}\).

Revised wall-clock estimates

At 150 TFLOP/s sustained: - \(C = 1.35 \times 10^{17}\) FLOPs → 900 s ≈ 15 min compute per run. - Add ~5 min warmup + ~5 min val + ~5 min checkpoint = 30 min wall-clock per run. - Three runs: ~90 min compute total. - Plus initial setup / data load / final eval: ~2 hours total wall-clock.

Revised cost: 2 h × \(1.10/hr = ~\)2.20 for the IsoFLOP triple. (Original $13 estimate is for a larger compute budget per point; we drop it to keep lab 01 cheap.)

Updated budget

Line $-cost Notes
1× A100 80GB spot, Lambda, 2 h × $1.10/hr $2.20 Three IsoFLOP runs
Persistent disk (reused from lab 00) $0 Already attached
Buffer $0.80 Restart slack
Lab 01 ceiling $3 Cheap

If you ran lab 00 first and re-use the same instance + data + checkpoints dir, lab 01 fits in $3.

The data

Reuse the FineWeb-Edu tokenized binary from lab 00 (/workspace/data/tokenized/train.bin).

For Run A's 4.5B tokens, we exceed the 10B-sample's distinct tokens if we want to avoid duplication. Allow ~1.5 epochs (4.5B / 10B = 0.45 epochs of the full sample; well under 1 epoch — fine).

The training command (per run)

Same script as lab 00, different config:

# Run A (5M params, 4.5B tokens)
python -m x1_pretrain.train \
  --config configs/x1-isoflop-A.yaml \
  --total-steps 8800 \
  --eval-every 880 \
  --mlflow-uri file:///workspace/mlruns \
  --mlflow-experiment isoflop-A

# Run B (25M, 900M tokens)
python -m x1_pretrain.train \
  --config configs/x1-isoflop-B.yaml \
  --total-steps 1760 \
  --eval-every 176 \
  --mlflow-experiment isoflop-B

# Run C (100M, 225M tokens)
python -m x1_pretrain.train \
  --config configs/x1-isoflop-C.yaml \
  --total-steps 440 \
  --eval-every 44 \
  --mlflow-experiment isoflop-C

Step counts assume effective batch tokens = 512K from lab 00: - Run A: 4.5B / 512K ≈ 8800 steps. - Run B: 900M / 512K ≈ 1760 steps. - Run C: 225M / 512K ≈ 440 steps.

Expected val losses

Predictions from the Hoffmann fit (theory/01-scaling-laws.md, with \(E=1.69\), \(A=406.4\), \(\alpha=0.34\), \(B=410.7\), \(\beta=0.28\)):

Run N D Predicted L Realistic range
A 5M 4.5B 4.66 [4.4, 5.0]
B 25M 900M 4.40 [4.2, 4.6]
C 100M 225M 4.48 [4.3, 4.7]

Note that Run B's predicted loss is lowest — that matches the IsoFLOP-minimum intuition. The realistic ranges are wide because (a) the fit was made at much larger scale and extrapolates worse at small \(N\), and (b) FineWeb-Edu is not the data Hoffmann fit on (he used MassiveText).

The fit

After all three runs complete, fit the Hoffmann form to the (N, D, L) triples.

python -m x1_pretrain.fit_scaling_law \
  --input experiments/X1-pretraining/scaling-law/runs.csv \
  --output experiments/X1-pretraining/scaling-law/fit.json \
  --bootstrap 1000

The script: 1. Reads (N, D, L) from runs.csv. 2. Fits \(L = E + A/N^\alpha + B/D^\beta\) via scipy.optimize.curve_fit. 3. Bootstraps 1000 resamples (resamples residuals; with only 3 points the CI is dominated by the residual noise). 4. Outputs fit.json with point estimates and 95% CIs for each of (E, A, α, B, β). 5. Renders iso-flop.png showing the three points and the fitted curve.

Expected output (illustrative):

{
  "E":     {"point": 2.1, "ci95": [1.4, 2.9]},
  "A":     {"point": 380, "ci95": [120, 950]},
  "alpha": {"point": 0.32, "ci95": [0.20, 0.46]},
  "B":     {"point": 290, "ci95": [80, 700]},
  "beta":  {"point": 0.26, "ci95": [0.15, 0.42]}
}

The CIs are wide because three points cannot pin down five parameters. That's expected and is the lesson.

The prediction

Given the fit, predict \(D_\text{opt}\) for \(N = 1\)B:

\[D_\text{opt}(N=10^9) = \left( \frac{B \beta N^\alpha}{A \alpha} \right)^{1/\beta} \cdot 10^9\]

(Derivation: minimize \(L(N, D)\) over \(D\) at fixed \(C = 6ND\) — gives the IsoFLOP-optimal scaling.)

Substituting the point estimates:

\[D_\text{opt}(10^9) = \left( \frac{290 \cdot 0.26 \cdot (10^9)^{0.32}}{380 \cdot 0.32} \right)^{1/0.26} \cdot 10^9\]

Compute step by step: - \((10^9)^{0.32} = 10^{2.88} \approx 760\). - Numerator: \(290 \cdot 0.26 \cdot 760 = 57{,}300\). - Denominator: \(380 \cdot 0.32 = 121.6\). - Ratio: \(471\). - \(471^{1/0.26} = 471^{3.85} = ?\)\(\log_{10}(471) \approx 2.673 \cdot 3.85 = 10.29\); \(10^{10.29} \approx 1.95 \times 10^{10}\).

So \(D_\text{opt}(10^9) \approx 20 \times 10^9 = 20\)B tokens — gives \(D/N \approx 20\), matching Chinchilla.

(The number you actually get from your fit may be 8B or 80B — anything in [3B, 100B] is within the 3-point CI. Quote the CI honestly.)

What this teaches

  1. The headline number (20 tokens/param) is robust — even with three points at small scale and slightly different data, you converge near the Chinchilla number. The "20" is a deep structural property of language modeling, not a quirk of MassiveText.
  2. The error bars are huge at small N. Predicting from 5M-100M to 1B is 2 orders of magnitude of extrapolation. The frontier-lab versions use 10-100B-param points and still quote uncertainty.
  3. IsoFLOP curves are nearly flat near the optimum. Run A and Run C are off the optimum, but their loss is only ~0.1 nat worse than Run B. The penalty for being 4× off-optimal is modest — which is why the field tolerated Kaplan's recipe for two years.
  4. Data quality interacts with the law. FineWeb-Edu's filtered tokens are individually more informative than MassiveText's. Your fitted \(E\) should be lower than Hoffmann's 1.69 by ~0.1-0.3, because the irreducible entropy of "high-quality educational text" is below that of "all web text."

Deliverables (DoD check 2)

  • experiments/X1-pretraining/scaling-law/runs.csv with rows (run_id, N, D, val_loss, $-spent).
  • experiments/X1-pretraining/scaling-law/fit.json with parameter estimates + CIs.
  • experiments/X1-pretraining/scaling-law/iso-flop.png showing the three points and the fitted curve.
  • experiments/X1-pretraining/scaling-law/prediction.md with:
  • Predicted \(D_\text{opt}(1\text{B})\) with 95% CI.
  • The implied \(D/N\) ratio with CI.
  • One paragraph comparing to the published Chinchilla number.
  • One paragraph honest about the extrapolation risk.

What can go wrong and how to respond

  • Run C diverges. 100M on 225M tokens is severely under-trained — the model is too big for the data and the loss can be noisier than the fit assumes. Run anyway; quote the noisy loss; the wide CI absorbs it.
  • Run A overfits the val set. With 4.5B tokens and only 1k val docs from the same distribution, val loss can dip below train loss. Use a held-out shard (last 10% of token range) — never overlap.
  • Throughput differs across runs. Smaller models often hit lower MFU because their batch is too small to keep the matmul cores busy. Reduce batch_size and add grad_accum for Run A to recover throughput, or accept it (the IsoFLOP target is fixed by \(C\), not by wall-clock).
  • The fit fails to converge. scipy.optimize.curve_fit can fail with bad initial guesses. Use Hoffmann's published values as the initialization.

End of Extension Module X1. After lab 01 completes, write learners/borja/extension-X1/reflections.md per the README DoD.