English · Español
01 — Scaling Laws¶
🇪🇸 Las leyes de escala son la única herramienta cuantitativa que tenemos para decidir, antes de gastar $1M, qué forma debe tener el modelo. Chinchilla (Hoffmann 2022) reemplazó la regla anterior de Kaplan 2020 con un consejo simple: por cada parámetro, ~20 tokens de entrenamiento. Sub-entrenar es caro; sobre-entrenar es caro; el óptimo es estrecho.
A scaling law is a fitted curve that predicts model quality (typically validation loss) from compute, parameter count, or training tokens. The field has converged on a small set of canonical fits.
The compute primitive: \(C \approx 6 N D\)¶
For a dense transformer trained with the standard forward+backward+optimizer-update loop:
where \(C\) is total training FLOPs, \(N\) is non-embedding parameter count, \(D\) is training tokens. (Kaplan et al. 2020, eq. 2.1; reproduced by Hoffmann 2022.)
Derivation sketch: - Forward pass: \(\approx 2ND\) FLOPs (one multiply + one add per parameter per token). - Backward pass: \(\approx 4ND\) FLOPs (two passes of forward-equivalent work). - Sum: \(6ND\).
This excludes the attention's \(O(L^2)\) term, which matters for long-context models. For \(L \lesssim 2048\) and the typical \(N \gtrsim 100\text{M}\), the \(6ND\) approximation is within ~5%. For \(L = 32\text{k}+\) at small \(N\), it can be off by 30%.
Kaplan 2020: the first scaling law¶
Kaplan et al. (OpenAI, 2020, "Scaling Laws for Neural Language Models") fit:
with \(\alpha_N \approx 0.076\), \(\alpha_D \approx 0.095\), and fitted critical scales \(N_c, D_c\). Their predicted compute-optimal ratio:
GPT-3 (175B params) was trained on ~300B tokens (Brown et al. 2020) — ratio ≈ 1.7. The Kaplan recipe.
This was wrong. Or, more precisely: Kaplan's experiments fixed the learning-rate schedule's terminal step at the wrong place, and the fit was systematically biased toward "make \(N\) bigger" because his runs were under-trained at the high-\(N\) end. The error was not noticed for 2 years.
Chinchilla (Hoffmann 2022): the correction¶
Hoffmann et al. (DeepMind, 2022, "Training Compute-Optimal Large Language Models") repeated the experiment with three approaches (IsoFLOP profiles, parametric loss fit, fixed-model-size sweeps) and got consistent answers:
That is, the compute-optimal model for budget \(C\) has:
with the consequence \(D_\text{opt} \approx 20 \cdot N_\text{opt}\).
They demonstrated this by training Chinchilla (70B params, 1.4T tokens) vs Gopher (280B params, 300B tokens). Same compute. Chinchilla beat Gopher across the board. The ratio is 1.4T / 70B = 20.
Hoffmann's parametric fit:
with fitted constants \(E = 1.69\), \(A = 406.4\), \(B = 410.7\), \(\alpha = 0.34\), \(\beta = 0.28\). The constant \(E\) is the irreducible loss (the entropy of the data, modulo tokenization).
The 20:1 rule and its breakdown¶
The headline number is:
Under-trained regime (\(D < 20 N\)): - Kaplan-style models (GPT-3 at \(D/N \approx 1.7\)) leave performance on the table. - The marginal token is more valuable than the marginal parameter. - This is the regime almost all production frontier models were in 2020–2022.
Over-trained regime (\(D > 20 N\)): - Llama-2 7B trained on 2T tokens (\(D/N \approx 286\)). - Llama-3 8B trained on 15T tokens (\(D/N \approx 1875\)). - Performance still improves but with diminishing returns. - Why do it? Because inference cost dominates training cost for any model that ships. Smaller-but-overtrained is cheaper to serve than bigger-but-compute-optimal. The economics flip past ~1B inference tokens served.
This is the "inference-aware" optimum: train a smaller model on more data if you expect heavy inference traffic. Sardana & Frankle 2023 ("Beyond Chinchilla-Optimal") formalize this.
Data quality dominates past Chinchilla¶
Past the 20:1 ratio, the dominant term in further loss reduction is not more tokens — it is better tokens.
Evidence: FineWeb-Edu (Penedo et al. 2024, HuggingFace) filters CommonCrawl with an LLM-classifier scoring "educational value." A 1.8B-param model trained on 350B tokens of FineWeb-Edu beats the same model trained on 1T tokens of raw CommonCrawl on knowledge benchmarks (MMLU, ARC) by ~5 percentage points.
The data-quality scaling law (informal): doubling data quality is worth ~3-5× more raw data, up to a point. Past that, the LLM-classifier becomes a bias source.
Implication for X1: the lab uses FineWeb-Edu shards (or Pile-CC as a fallback), not raw CommonCrawl. The 50M-param model on 5B FineWeb-Edu tokens reaches lower val-loss than 50M on 5B raw-CommonCrawl tokens; the gap is measurable and reproducible.
Citations: the numbers behind the headline¶
| Model | Params (N) | Tokens (D) | D/N | Source |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | Brown 2020, Table 2.1 |
| Gopher | 280B | 300B | 1.1 | Rae 2021, §2 |
| Chinchilla | 70B | 1.4T | 20 | Hoffmann 2022, Table 4 |
| LLaMA-1 7B | 6.7B | 1.0T | 149 | Touvron 2023a, Table 1 |
| Llama-2 7B | 6.7B | 2.0T | 298 | Touvron 2023b, Table 1 |
| Llama-3 8B | 8.0B | 15.0T | 1875 | Meta 2024 (model card) |
| Falcon-180B | 180B | 3.5T | 19 | Almazrouei 2023, Table 1 |
| OLMo-7B | 7B | 2.5T | 357 | Groeneveld 2024, Table 1 |
| Pythia-12B | 12B | 300B | 25 | Biderman 2023, Table 1 |
| Mistral-7B | 7.3B | ~8T (est.) | ~1100 | Jiang 2023 (D unconfirmed) |
The trend: 2020-era models were under-trained (D/N < 5). 2022 Chinchilla showed the optimum (D/N ≈ 20). Post-2023 frontier models are deliberately over-trained (D/N from 150 to 1900) because inference economics matter more than training economics for shipping products.
What to do with the law¶
Given a budget \(C\) (FLOPs you can afford):
- Compute \(N_\text{opt} \approx 0.6 \cdot (C/6)^{0.5}\), \(D_\text{opt} \approx 1.7 \cdot (C/6)^{0.5}\).
- If you plan to serve >\(N_\text{opt} \cdot 10^9\) inference tokens, shrink \(N\) by 2-4× and scale \(D\) to preserve \(C\) — you'll save inference cost.
- If you have a fixed dataset size \(D\) (e.g. all of high-quality Pile = ~825GB ≈ 200B tokens), pick the largest \(N\) with \(D \geq 20 N\): \(N_\text{max} = D / 20\).
- Always reserve ~10% of \(C\) for an ablation re-run if the first launch has a recoverable bug.
Worked example: the X1 lab numbers¶
Lab 00 trains 50M params for 24h on 1× A100 80GB at ~150 TFLOP/s sustained (bf16, MFU ~0.40 of A100's 312 TFLOP/s bf16 peak).
- Sustained throughput: \(1.5 \times 10^{14}\) FLOP/s × \(8.64 \times 10^4\) s ≈ \(1.3 \times 10^{19}\) FLOP per 24 h.
- With \(C = 6 N D\), given \(N = 5 \times 10^7\): \(D = C / (6 N) = 1.3 \times 10^{19} / (6 \times 5 \times 10^7) \approx 4.3 \times 10^{10}\) tokens ≈ 43B tokens.
- That is \(D/N \approx 860\) — deliberately over-trained, matching the modern small-model recipe.
- The Chinchilla-optimal for \(C = 1.3 \times 10^{19}\) would be \(N \approx 230M\), \(D \approx 4.6B\). We instead train a 50M for 24h, hitting a 4-5× over-trained regime that mirrors Llama-3-8B's ratio at toy scale.
The lab then verifies the predicted val-loss against the Hoffmann fit with \(E=1.69\), \(A=406.4\), \(\alpha=0.34\), \(B=410.7\), \(\beta=0.28\):
≈ \(1.69 + 1.15 + 0.51 \approx 3.35\) nats/token.
This is the prediction the lab compares against. A run that lands within ±0.15 of 3.35 is a successful fit. >0.4 deviation is a bug.
Next: theory/02-cluster-economics.md.