English · Español

Extension X1 — Pretraining at Scale¶

Requires: 35 — Distributed Training & Inference Teaches: scaling-laws · chinchilla · mfu · pretraining · cluster-economics Jump to any chapter from the phase reference index.

Chapter map¶

Extension track. Authorized by §A15 (extension addendum, parallel). This module sits outside the 40-phase core curriculum and closes the "no actual large pretraining" gap flagged in HIRING_PATH.md §"Honest gaps" (line 262). It is not a scope expansion of the §A13 microscopic universe: the verb-grammar tutor remains the curriculum capstone. X1 is the post-curriculum module where Borja runs one real pretraining job on a real cloud GPU so that the words "loss spike," "Chinchilla-optimal," and "MFU 0.45" map to lived numbers, not blog posts.

🇪🇸 X1 cierra el hueco de pre-entrenamiento real señalado en HIRING_PATH.md. Una sola corrida de 24 h en la nube, 50M parámetros, ~5B tokens, presupuesto duro de $50 — suficiente para sentir las dinámicas (curva de loss, spikes, MFU, costo por token) sin entrar en presupuesto de laboratorio frontera.

Status¶

Type: post-curriculum extension (not a phase).
Prerequisites: Phase 18 (training loop), Phase 26 (quantization, for the inference-cost epilogue), Phase 35 (distributed parallelism vocabulary). The lab assumes you have already shipped a working mlflow + safetensors + manifest pipeline from Phase 18.
Hardware: 1× cloud GPU (A100 80GB target). No multi-node. Multi-GPU is described in theory only — runs are single-GPU.
Spend ceiling: $50 (hard cap, enforced by budget_guard.py reused from Phase 35).
Duration: ~3 days of wall-clock effort over ~2 weeks elapsed (theory reading + one ~24 h cloud run + one scaling-law triple).

What this extension teaches¶

By the end of X1 you can:

State and apply the Chinchilla compute-optimal rule ($N_\text{opt} \approx 20 \cdot D$, where $N$ is parameter count and $D$ is training tokens) and explain when it under- vs over-estimates.
Estimate the dollar cost of training an arbitrary $N$-parameter model to $D$ tokens on a specified cluster, including MFU, $/GPU-hour spot/on-demand, and energy.
Run a 24-hour single-GPU pretraining job on lambda.ai or runpod.io, train a ~50M-parameter decoder-only transformer on ~5B tokens of Pile-CC / FineWeb-Edu, and produce a reproducible loss curve.
Fit a scaling law from a 3-point sweep (5M / 25M / 100M params, fixed token budget) and predict the compute-optimal token count for a hypothetical 1B-param model.
Diagnose a loss spike post-hoc from a mlflow run: identify the offending batch range, classify the failure mode (rare-token gradient, Adam β₂ explosion, learning-rate-too-high, dataset corruption), and propose the appropriate recovery (skip-batch, restart-from-prior-ckpt, clip-tighter, μP rescale).
Read a real frontier pretraining log (OLMo, Pythia, Falcon, Llama) and name the architectural and data choices that drove each line.

Read order¶

Theory (read first, in order)¶

theory/00-motivation.md — why one cloud run; the gap this closes; what you can and cannot learn from $50 of pretraining.
theory/01-scaling-laws.md — Kaplan 2020, Hoffmann (Chinchilla) 2022; the 20-token/param rule; data-vs-params Pareto; over- and under-training regimes.
theory/02-cluster-economics.md — H100 / A100 / H200 specs; $/GPU-hour spot vs on-demand; bandwidth costs; energy; the 7B-for-$1M math worked out.
theory/03-data-pipelines-at-scale.md — CommonCrawl → quality filter → dedup → tokenize → shard; FineWeb-Edu; throughput math; sharding for streaming.
theory/04-training-stability-at-scale.md — loss spikes, μP, weight decay; recovery procedures; mid-training interventions (LR resets, curriculum, data swaps).

Lab (do after theory)¶

lab/00-one-day-cloud-pretraining.md — the one real run. ~24 h on 1× A100 80GB. Hard $50 cap.
lab/01-scaling-laws-experiment.md — three smaller runs (5M / 25M / 100M params) on a fixed compute budget. Fit Chinchilla curve. Predict 1B-param optimal.

Cross-links¶

Phase 18 — Training loop. The optimizer, scheduler, checkpoint, and mlflow wiring you ship in Phase 18 is the substrate X1 runs on top of. The X1 lab adds a streaming dataloader and a token-budget runner — everything else is reused. (docs/phase-18-training-loop/)
Phase 26 — Quantization. After pretraining, you quantize the 50M checkpoint to int8 / int4 and measure inference $/token. The capstone artifact of X1 is a cost-per-token number for the model you trained versus the open-weights baselines. (docs/phase-26-quantization/)
Phase 35 — Distributed. X1 is single-GPU. The "what changes at >10B params" content of theory/02 and theory/04 is grounded in Phase 35's TP/PP/ZeRO-3 vocabulary. If those words don't yet map to mechanism, finish Phase 35 first. (docs/phase-35-distributed/)
Phase 36 — Frontier architectures. The model-shape choice for the lab (depth, width, head count, GQA on/off) cites Phase 36's design space. The X1 lab uses a mid-2024-default shape (32 layers × 768 width × 12 heads × GQA-4); Phase 36 is where you'd justify deviating from it. (docs/phase-36-frontier-architectures/)

Definition of Done (extension-track DoD)¶

Extension tracks do not have a PHASE_NN_REPORT.md (they're outside the 40-phase ritual). X1 ships these four binary checks:

One reproducible cloud run. experiments/X1-pretraining/run-cloud/ contains: manifest.json (seed, versions, config, cluster, $-spent), mlflow URI, loss-curve.png, and safetensors checkpoint. The run hit the 24-hour wall-clock and stayed under $50.
Scaling-law triple done. experiments/X1-pretraining/scaling-law/ has three runs (5M, 25M, 100M params), one CSV of (params, tokens, val-loss), one fitted curve plot, and one written prediction for 1B-param compute-optimal token count with a confidence interval.
Loss-spike post-mortem. experiments/X1-pretraining/spike-postmortem.md exists with: timestamp of detected spike (synthetic if no real one occurred — inject one), classification, evidence (gradient norm log, batch-token-histogram), recovery action taken, and outcome.
Quiz /quiz X1 ≥ 80%.

A short reflections.md in learners/borja/extension-X1/ covers: what surprised me about MFU at this scale; how Chinchilla-optimal differed from my pre-run intuition; which cost line dominated (compute / bandwidth / storage / human-hours).

Cost estimate (concrete)¶

Line item	Detail	$
Lab 00 (one-day pretraining)	1× A100 80GB spot @ ~$1.10/hr × 24 h	$26
Lab 00 storage egress	200 GB tokenized data, hot for 24 h	$3
Lab 01 scaling-law triple	3× ~4 h runs on same node = 12 h × $1.10/hr	$13
Buffer for restarts / debug	~10% padding	$5
Total ceiling	enforced by `budget_guard.py`	$50

If the actual spot price is higher (A100 80GB has seen $1.79/hr on Lambda on-demand, $0.79/hr lowest-quartile on RunPod community spot), budget_guard.py refuses the launch and you re-plan. Do not silently exceed.

What this extension intentionally does NOT cover¶

Multi-node training. Single GPU only. Inter-node InfiniBand / NVLink topologies are covered in theory/02-cluster-economics.md for vocabulary, not run.
A frontier-scale model. 50M params is the run size. We extrapolate to 1B / 7B / 70B in theory; we do not train them.
Custom CUDA kernels. Phase 24 covers Triton. X1 uses stock torch.compile / FlashAttention-2 from pip.
MoE / sparse models. Phase 36 territory. X1 stays dense.
RLHF / SFT on top of the pretrained model. Phase 31 / hypothetical X3 territory. X1 stops at the base model.
Tokenizer training. We use a pretrained tokenizer (GPT-2 BPE or Llama-3 tokenizer). Tokenizer training at scale is Phase 11 territory; the data-scale of that step is dwarfed by the model-training step we focus on here.
Hyperparameter search. One configuration, derived from theory. If it diverges, we post-mortem and re-launch once; we do not sweep.

Build-before-abstract policy for X1¶

The core curriculum rule (CLAUDE.md §0.4): no PyTorch before Phase 25, no transformers before Phase 24. By the time you reach X1, both gates are open. X1 uses:

PyTorch 2.4+ with torch.compile for the training loop.
transformers.GPT2Tokenizer (or the Llama-3 tokenizer) — we do not retrain a tokenizer.
datasets for streaming Pile-CC / FineWeb-Edu.
flash-attn for the attention kernel (CUDA-only; gated behind a runtime check).
NO accelerate, NO deepspeed, NO lightning. Single-GPU is single-process; we do not need a framework to launch one Python script.

The training script itself is a single ~400-line nanoGPT-shaped file in src/x1_pretrain/train.py. The point is to keep every variable visible — when the loss spikes at hour 14, you can read the loop top-to-bottom without leaving the file.

Next: theory/00-motivation.md.