English · Español

00 — Motivation: why one cloud run¶

🇪🇸 Hay un techo de aprendizaje que solo se rompe pagando: las dinámicas de pre-entrenamiento a escala no se entienden leyendo papers. X1 paga ese mínimo ($50) para que palabras como "loss spike," "MFU 0.45" y "Chinchilla-optimal" dejen de ser jerga y pasen a ser números medidos.

The 40-phase curriculum trains a ≤500 k-parameter model on a 600-form verb-grammar corpus. That is enough to derive every mechanism — attention, autograd, batching, mixed precision drift, KV cache, RAG — from first principles. It is not enough to know what training at scale feels like.

This module exists to close that gap with the smallest possible spend.

What "at scale" actually means here¶

"Pretraining at scale" in industry means 1B–70B+ parameters on 1T–15T tokens on 100–10000 GPUs for 10–100 days. That world costs $1M–$100M per run. We are not going there.

We are going to the first rung that still teaches the dynamics:

50M parameters (~100× larger than MiniGPT, ~140× smaller than Llama-3-8B).
~5B tokens (the Chinchilla-optimal data budget for 50M params; 50M × 20 ≈ 1B is technically optimal, we'll over-train slightly).
1× A100 80GB, 24 hours, ~$30.

At this scale you can:

See a real loss curve drop from $\ln V$ (initial entropy) to ~3.0 nats/token over 5B tokens. (Bigger models hit ~2.0; we won't.)
Hit an honest MFU number (model FLOPs utilization). 1× A100 at bf16, GPT-style: 0.35–0.45 MFU is realistic.
Witness or inject a loss spike and run the post-mortem.
Quote Chinchilla and check whether your run matched the prediction.

You cannot:

Observe the qualitative phase transitions ("emergent capabilities") that appear past ~1B params and ~50B tokens. Those need real money.
Stress-test 3D parallelism, ZeRO-3, or pipeline bubble math experimentally. Phase 35 covers them as theory + reading.
Practice cluster recovery (node failures, NCCL timeouts, checkpoint corruption at scale). Different curriculum (the SRE-shaped one).

The honest gap from `HIRING_PATH.md`¶

From HIRING_PATH.md line 262:

Pretraining at scale. The curriculum trains a 100k–500k-param model on a 600-form corpus. The dynamics of multi-billion-parameter pretraining (loss-curve idiosyncrasies, OPT-style instabilities, dynamic-loss-scaling subtleties) are read about, not lived. Phase 35 surveys distributed; it does not run a 70B-model job.

X1 makes one targeted move against this gap: it does run a real (small) pretraining job. After X1 you have a safetensors checkpoint of a model you trained from random init on real web data, plus a loss curve, plus a cost ledger.

That is not 70B. It is also not zero. It is the minimum non-zero.

Why not just read more papers?¶

You should also do that — theory/01-04 exist for that reason. But:

Numbers in papers are quoted at convenient round-figures. (Chinchilla "20 tokens/param" is a fitted curve, not an integer.) Running once shows you the noise around the headline.
The infrastructure tax of pretraining (data pipeline throughput, checkpoint cadence, eval interleaving, killing-and-resuming) is invisible from a paper. The lab forces you through it.
A loss spike at hour 14 of your run is pedagogy you can't buy any other way.

What this module is not¶

Three failure modes for a "pretraining at scale" learning module to avoid:

1. The benchmark hunt. This is not "beat OLMo-150M on its eval suite." We do not have the data or the budget. We will lose. The eval is "did the run finish, did it produce a checkpoint, did the loss curve match the scaling-law prediction within a stated tolerance."

2. The framework parade. It is tempting to use this as an excuse to learn accelerate, deepspeed, megatron, lightning, all at once. We will use none of them. Single-GPU, plain torch.compile, one Python file. The point is to keep every line of the training loop visible at hour 14 when something goes wrong.

3. The infinite tinker. One configuration, derived from theory. If it diverges, post-mortem, one re-launch. We do not sweep. Sweeps are how you spend $5 k by accident.

The cost-discipline thesis¶

Phase 35 introduced src/distributed/budget_guard.py and a $5 ceiling. X1 raises the ceiling to $50 but keeps the discipline. Every cloud action in this module starts with:

"This will spin up 1× A100 80GB spot on Lambda at $1.10/hr, run for ~24 h including setup tax = $26.40. Budget for lab 00: $35. Confirmed acceptable, $8.60 buffer."

If you cannot write that sentence about a run, the run does not launch. The budget_guard.py wrapper from Phase 35 enforces this — the same module, raised ceiling, same rules.

If hour-by-hour spend exceeds projection by >20% (e.g. the spot price tripled), the guard alerts; you decide kill or continue. There is no silent over-spend mode.

What lands in the wider curriculum¶

X1 is post-curriculum but feeds back into a few places:

Phase 26 quantization. The capstone artifact of X1 is the 50M base model. You then take it through Phase 26's int8 / int4 paths and measure $/token on inference. This grounds the quantization math in a checkpoint you trained.
Phase 32 agents. The Phase 32 grammar tutor uses an open-weights LLM (Llama-3-8B class or similar). After X1, you have a calibrated intuition for what that model cost the team that pretrained it (you can do the math in theory/02), which sharpens the "use vs train" decision.
HIRING_PATH.md honesty. The line on page 1 — "I have run one pretraining job on one cloud GPU; here is the loss curve, the cost, and the scaling-law fit" — is no longer a disclaimer. It is a link.

One-paragraph recap¶

X1 closes the "no real pretraining" gap in the curriculum with one targeted, budget-capped, single-GPU run: 50M parameters, ~5B tokens, 24 hours, ~$30, hard $50 ceiling. The goal is lived numbers: a loss curve you can defend, a scaling-law fit you measured, a loss-spike post-mortem you wrote. Not a benchmark win, not a framework parade, not an infinite tinker. One configuration, derived from theory, run once, post-mortemed, and shipped as an artifact.

Next: theory/01-scaling-laws.md.