English · Español

01 — Scaling Laws¶

🇪🇸 Las leyes de escala son la única herramienta cuantitativa que tenemos para decidir, antes de gastar $1M, qué forma debe tener el modelo. Chinchilla (Hoffmann 2022) reemplazó la regla anterior de Kaplan 2020 con un consejo simple: por cada parámetro, ~20 tokens de entrenamiento. Sub-entrenar es caro; sobre-entrenar es caro; el óptimo es estrecho.

A scaling law is a fitted curve that predicts model quality (typically validation loss) from compute, parameter count, or training tokens. The field has converged on a small set of canonical fits.

The compute primitive: $C \approx 6 N D$¶

For a dense transformer trained with the standard forward+backward+optimizer-update loop:

\[C \;\approx\; 6 \cdot N \cdot D\]

where $C$ is total training FLOPs, $N$ is non-embedding parameter count, $D$ is training tokens. (Kaplan et al. 2020, eq. 2.1; reproduced by Hoffmann 2022.)

Derivation sketch: - Forward pass: $\approx 2ND$ FLOPs (one multiply + one add per parameter per token). - Backward pass: $\approx 4ND$ FLOPs (two passes of forward-equivalent work). - Sum: $6ND$.

This excludes the attention's $O(L^2)$ term, which matters for long-context models. For $L \lesssim 2048$ and the typical $N \gtrsim 100\text{M}$, the $6ND$ approximation is within ~5%. For $L = 32\text{k}+$ at small $N$, it can be off by 30%.

Kaplan 2020: the first scaling law¶

Kaplan et al. (OpenAI, 2020, "Scaling Laws for Neural Language Models") fit:

\[L(N, D) = \left( \frac{N_c}{N} \right)^{\alpha_N} + \left( \frac{D_c}{D} \right)^{\alpha_D}\]

with $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, and fitted critical scales $N_c, D_c$. Their predicted compute-optimal ratio:

\[\frac{D_\text{opt}}{N_\text{opt}} \;\approx\; 1.7 \text{ tokens/param}\]

GPT-3 (175B params) was trained on ~300B tokens (Brown et al. 2020) — ratio ≈ 1.7. The Kaplan recipe.

This was wrong. Or, more precisely: Kaplan's experiments fixed the learning-rate schedule's terminal step at the wrong place, and the fit was systematically biased toward "make $N$ bigger" because his runs were under-trained at the high-$N$ end. The error was not noticed for 2 years.

Chinchilla (Hoffmann 2022): the correction¶

Hoffmann et al. (DeepMind, 2022, "Training Compute-Optimal Large Language Models") repeated the experiment with three approaches (IsoFLOP profiles, parametric loss fit, fixed-model-size sweeps) and got consistent answers:

\[\frac{D_\text{opt}}{N_\text{opt}} \;\approx\; 20 \text{ tokens/param}\]

That is, the compute-optimal model for budget $C$ has:

\[N_\text{opt} \;\approx\; 0.6 \cdot \left( \frac{C}{6} \right)^{0.5} \qquad D_\text{opt} \;\approx\; 1.7 \cdot \left( \frac{C}{6} \right)^{0.5}\]

with the consequence $D_\text{opt} \approx 20 \cdot N_\text{opt}$.

They demonstrated this by training Chinchilla (70B params, 1.4T tokens) vs Gopher (280B params, 300B tokens). Same compute. Chinchilla beat Gopher across the board. The ratio is 1.4T / 70B = 20.

Hoffmann's parametric fit:

\[L(N, D) \;=\; E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}\]

with fitted constants $E = 1.69$, $A = 406.4$, $B = 410.7$, $\alpha = 0.34$, $\beta = 0.28$. The constant $E$ is the irreducible loss (the entropy of the data, modulo tokenization).

The 20:1 rule and its breakdown¶

The headline number is:

\[\boxed{D_\text{opt} \;\approx\; 20 \cdot N}\]

Under-trained regime ($D < 20 N$): - Kaplan-style models (GPT-3 at $D/N \approx 1.7$) leave performance on the table. - The marginal token is more valuable than the marginal parameter. - This is the regime almost all production frontier models were in 2020–2022.

Over-trained regime ($D > 20 N$): - Llama-2 7B trained on 2T tokens ($D/N \approx 286$). - Llama-3 8B trained on 15T tokens ($D/N \approx 1875$). - Performance still improves but with diminishing returns. - Why do it? Because inference cost dominates training cost for any model that ships. Smaller-but-overtrained is cheaper to serve than bigger-but-compute-optimal. The economics flip past ~1B inference tokens served.

This is the "inference-aware" optimum: train a smaller model on more data if you expect heavy inference traffic. Sardana & Frankle 2023 ("Beyond Chinchilla-Optimal") formalize this.

Data quality dominates past Chinchilla¶

Past the 20:1 ratio, the dominant term in further loss reduction is not more tokens — it is better tokens.

Evidence: FineWeb-Edu (Penedo et al. 2024, HuggingFace) filters CommonCrawl with an LLM-classifier scoring "educational value." A 1.8B-param model trained on 350B tokens of FineWeb-Edu beats the same model trained on 1T tokens of raw CommonCrawl on knowledge benchmarks (MMLU, ARC) by ~5 percentage points.

The data-quality scaling law (informal): doubling data quality is worth ~3-5× more raw data, up to a point. Past that, the LLM-classifier becomes a bias source.

Implication for X1: the lab uses FineWeb-Edu shards (or Pile-CC as a fallback), not raw CommonCrawl. The 50M-param model on 5B FineWeb-Edu tokens reaches lower val-loss than 50M on 5B raw-CommonCrawl tokens; the gap is measurable and reproducible.

Citations: the numbers behind the headline¶

Model	Params (N)	Tokens (D)	D/N	Source
GPT-3	175B	300B	1.7	Brown 2020, Table 2.1
Gopher	280B	300B	1.1	Rae 2021, §2
Chinchilla	70B	1.4T	20	Hoffmann 2022, Table 4
LLaMA-1 7B	6.7B	1.0T	149	Touvron 2023a, Table 1
Llama-2 7B	6.7B	2.0T	298	Touvron 2023b, Table 1
Llama-3 8B	8.0B	15.0T	1875	Meta 2024 (model card)
Falcon-180B	180B	3.5T	19	Almazrouei 2023, Table 1
OLMo-7B	7B	2.5T	357	Groeneveld 2024, Table 1
Pythia-12B	12B	300B	25	Biderman 2023, Table 1
Mistral-7B	7.3B	~8T (est.)	~1100	Jiang 2023 (D unconfirmed)

The trend: 2020-era models were under-trained (D/N < 5). 2022 Chinchilla showed the optimum (D/N ≈ 20). Post-2023 frontier models are deliberately over-trained (D/N from 150 to 1900) because inference economics matter more than training economics for shipping products.

What to do with the law¶

Given a budget $C$ (FLOPs you can afford):

Compute $N_\text{opt} \approx 0.6 \cdot (C/6)^{0.5}$, $D_\text{opt} \approx 1.7 \cdot (C/6)^{0.5}$.
If you plan to serve >$N_\text{opt} \cdot 10^9$ inference tokens, shrink $N$ by 2-4× and scale $D$ to preserve $C$ — you'll save inference cost.
If you have a fixed dataset size $D$ (e.g. all of high-quality Pile = ~825GB ≈ 200B tokens), pick the largest $N$ with $D \geq 20 N$: $N_\text{max} = D / 20$.
Always reserve ~10% of $C$ for an ablation re-run if the first launch has a recoverable bug.

Worked example: the X1 lab numbers¶

Lab 00 trains 50M params for 24h on 1× A100 80GB at ~150 TFLOP/s sustained (bf16, MFU ~0.40 of A100's 312 TFLOP/s bf16 peak).

Sustained throughput: $1.5 \times 10^{14}$ FLOP/s × $8.64 \times 10^4$ s ≈ $1.3 \times 10^{19}$ FLOP per 24 h.
With $C = 6 N D$, given $N = 5 \times 10^7$: $D = C / (6 N) = 1.3 \times 10^{19} / (6 \times 5 \times 10^7) \approx 4.3 \times 10^{10}$ tokens ≈ 43B tokens.
That is $D/N \approx 860$ — deliberately over-trained, matching the modern small-model recipe.
The Chinchilla-optimal for $C = 1.3 \times 10^{19}$ would be $N \approx 230M$, $D \approx 4.6B$. We instead train a 50M for 24h, hitting a 4-5× over-trained regime that mirrors Llama-3-8B's ratio at toy scale.

The lab then verifies the predicted val-loss against the Hoffmann fit with $E=1.69$, $A=406.4$, $\alpha=0.34$, $B=410.7$, $\beta=0.28$:

\[L(5 \times 10^7, 4.3 \times 10^{10}) \approx 1.69 + \frac{406.4}{(5 \times 10^7)^{0.34}} + \frac{410.7}{(4.3 \times 10^{10})^{0.28}}\]

≈ $1.69 + 1.15 + 0.51 \approx 3.35$ nats/token.

This is the prediction the lab compares against. A run that lands within ±0.15 of 3.35 is a successful fit. >0.4 deviation is a bug.

Next: theory/02-cluster-economics.md.