English · Español
Extension X1 — Pretraining at Scale¶
Requires: 35 — Distributed Training & Inference Teaches:
scaling-laws·chinchilla·mfu·pretraining·cluster-economicsJump to any chapter from the phase reference index.
Chapter map¶
Extension track. Authorized by §A15 (extension addendum, parallel). This module sits outside the 40-phase core curriculum and closes the "no actual large pretraining" gap flagged in
HIRING_PATH.md§"Honest gaps" (line 262). It is not a scope expansion of the §A13 microscopic universe: the verb-grammar tutor remains the curriculum capstone. X1 is the post-curriculum module where Borja runs one real pretraining job on a real cloud GPU so that the words "loss spike," "Chinchilla-optimal," and "MFU 0.45" map to lived numbers, not blog posts.🇪🇸 X1 cierra el hueco de pre-entrenamiento real señalado en
HIRING_PATH.md. Una sola corrida de 24 h en la nube, 50M parámetros, ~5B tokens, presupuesto duro de $50 — suficiente para sentir las dinámicas (curva de loss, spikes, MFU, costo por token) sin entrar en presupuesto de laboratorio frontera.
Status¶
- Type: post-curriculum extension (not a phase).
- Prerequisites: Phase 18 (training loop), Phase 26 (quantization, for the inference-cost epilogue), Phase 35 (distributed parallelism vocabulary). The lab assumes you have already shipped a working
mlflow+safetensors+ manifest pipeline from Phase 18. - Hardware: 1× cloud GPU (A100 80GB target). No multi-node. Multi-GPU is described in theory only — runs are single-GPU.
- Spend ceiling: $50 (hard cap, enforced by
budget_guard.pyreused from Phase 35). - Duration: ~3 days of wall-clock effort over ~2 weeks elapsed (theory reading + one ~24 h cloud run + one scaling-law triple).
What this extension teaches¶
By the end of X1 you can:
- State and apply the Chinchilla compute-optimal rule (\(N_\text{opt} \approx 20 \cdot D\), where \(N\) is parameter count and \(D\) is training tokens) and explain when it under- vs over-estimates.
- Estimate the dollar cost of training an arbitrary \(N\)-parameter model to \(D\) tokens on a specified cluster, including MFU, $/GPU-hour spot/on-demand, and energy.
- Run a 24-hour single-GPU pretraining job on
lambda.aiorrunpod.io, train a ~50M-parameter decoder-only transformer on ~5B tokens of Pile-CC / FineWeb-Edu, and produce a reproducible loss curve. - Fit a scaling law from a 3-point sweep (5M / 25M / 100M params, fixed token budget) and predict the compute-optimal token count for a hypothetical 1B-param model.
- Diagnose a loss spike post-hoc from a
mlflowrun: identify the offending batch range, classify the failure mode (rare-token gradient, Adam β₂ explosion, learning-rate-too-high, dataset corruption), and propose the appropriate recovery (skip-batch, restart-from-prior-ckpt, clip-tighter, μP rescale). - Read a real frontier pretraining log (OLMo, Pythia, Falcon, Llama) and name the architectural and data choices that drove each line.
Read order¶
Theory (read first, in order)¶
theory/00-motivation.md— why one cloud run; the gap this closes; what you can and cannot learn from $50 of pretraining.theory/01-scaling-laws.md— Kaplan 2020, Hoffmann (Chinchilla) 2022; the 20-token/param rule; data-vs-params Pareto; over- and under-training regimes.theory/02-cluster-economics.md— H100 / A100 / H200 specs; \(/GPU-hour spot vs on-demand; bandwidth costs; energy; the 7B-for-\)1M math worked out.theory/03-data-pipelines-at-scale.md— CommonCrawl → quality filter → dedup → tokenize → shard; FineWeb-Edu; throughput math; sharding for streaming.theory/04-training-stability-at-scale.md— loss spikes, μP, weight decay; recovery procedures; mid-training interventions (LR resets, curriculum, data swaps).
Lab (do after theory)¶
lab/00-one-day-cloud-pretraining.md— the one real run. ~24 h on 1× A100 80GB. Hard $50 cap.lab/01-scaling-laws-experiment.md— three smaller runs (5M / 25M / 100M params) on a fixed compute budget. Fit Chinchilla curve. Predict 1B-param optimal.
Cross-links¶
- Phase 18 — Training loop. The optimizer, scheduler, checkpoint, and
mlflowwiring you ship in Phase 18 is the substrate X1 runs on top of. The X1 lab adds a streaming dataloader and a token-budget runner — everything else is reused. (docs/phase-18-training-loop/) - Phase 26 — Quantization. After pretraining, you quantize the 50M checkpoint to int8 / int4 and measure inference $/token. The capstone artifact of X1 is a cost-per-token number for the model you trained versus the open-weights baselines. (
docs/phase-26-quantization/) - Phase 35 — Distributed. X1 is single-GPU. The "what changes at >10B params" content of
theory/02andtheory/04is grounded in Phase 35's TP/PP/ZeRO-3 vocabulary. If those words don't yet map to mechanism, finish Phase 35 first. (docs/phase-35-distributed/) - Phase 36 — Frontier architectures. The model-shape choice for the lab (depth, width, head count, GQA on/off) cites Phase 36's design space. The X1 lab uses a mid-2024-default shape (32 layers × 768 width × 12 heads × GQA-4); Phase 36 is where you'd justify deviating from it. (
docs/phase-36-frontier-architectures/)
Definition of Done (extension-track DoD)¶
Extension tracks do not have a PHASE_NN_REPORT.md (they're outside the 40-phase ritual). X1 ships these four binary checks:
- One reproducible cloud run.
experiments/X1-pretraining/run-cloud/contains:manifest.json(seed, versions, config, cluster, $-spent),mlflowURI,loss-curve.png, andsafetensorscheckpoint. The run hit the 24-hour wall-clock and stayed under $50. - Scaling-law triple done.
experiments/X1-pretraining/scaling-law/has three runs (5M, 25M, 100M params), one CSV of (params, tokens, val-loss), one fitted curve plot, and one written prediction for 1B-param compute-optimal token count with a confidence interval. - Loss-spike post-mortem.
experiments/X1-pretraining/spike-postmortem.mdexists with: timestamp of detected spike (synthetic if no real one occurred — inject one), classification, evidence (gradient norm log, batch-token-histogram), recovery action taken, and outcome. - Quiz
/quiz X1≥ 80%.
A short reflections.md in learners/borja/extension-X1/ covers: what surprised me about MFU at this scale; how Chinchilla-optimal differed from my pre-run intuition; which cost line dominated (compute / bandwidth / storage / human-hours).
Cost estimate (concrete)¶
| Line item | Detail | $ |
|---|---|---|
| Lab 00 (one-day pretraining) | 1× A100 80GB spot @ ~$1.10/hr × 24 h | $26 |
| Lab 00 storage egress | 200 GB tokenized data, hot for 24 h | $3 |
| Lab 01 scaling-law triple | 3× ~4 h runs on same node = 12 h × $1.10/hr | $13 |
| Buffer for restarts / debug | ~10% padding | $5 |
| Total ceiling | enforced by budget_guard.py |
$50 |
If the actual spot price is higher (A100 80GB has seen $1.79/hr on Lambda on-demand, $0.79/hr lowest-quartile on RunPod community spot), budget_guard.py refuses the launch and you re-plan. Do not silently exceed.
What this extension intentionally does NOT cover¶
- Multi-node training. Single GPU only. Inter-node InfiniBand / NVLink topologies are covered in
theory/02-cluster-economics.mdfor vocabulary, not run. - A frontier-scale model. 50M params is the run size. We extrapolate to 1B / 7B / 70B in theory; we do not train them.
- Custom CUDA kernels. Phase 24 covers Triton. X1 uses stock
torch.compile/ FlashAttention-2 frompip. - MoE / sparse models. Phase 36 territory. X1 stays dense.
- RLHF / SFT on top of the pretrained model. Phase 31 / hypothetical X3 territory. X1 stops at the base model.
- Tokenizer training. We use a pretrained tokenizer (GPT-2 BPE or Llama-3 tokenizer). Tokenizer training at scale is Phase 11 territory; the data-scale of that step is dwarfed by the model-training step we focus on here.
- Hyperparameter search. One configuration, derived from theory. If it diverges, we post-mortem and re-launch once; we do not sweep.
Build-before-abstract policy for X1¶
The core curriculum rule (CLAUDE.md §0.4): no PyTorch before Phase 25, no transformers before Phase 24. By the time you reach X1, both gates are open. X1 uses:
- PyTorch 2.4+ with
torch.compilefor the training loop. transformers.GPT2Tokenizer(or the Llama-3 tokenizer) — we do not retrain a tokenizer.datasetsfor streaming Pile-CC / FineWeb-Edu.flash-attnfor the attention kernel (CUDA-only; gated behind a runtime check).- NO
accelerate, NOdeepspeed, NOlightning. Single-GPU is single-process; we do not need a framework to launch one Python script.
The training script itself is a single ~400-line nanoGPT-shaped file in src/x1_pretrain/train.py. The point is to keep every variable visible — when the loss spikes at hour 14, you can read the loop top-to-bottom without leaving the file.
Next: theory/00-motivation.md.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al. · 2022. the tokens-vs-params budget you plan against.
- 📄 Scaling Laws for Neural Language Models — Kaplan et al. · 2020. the power laws Chinchilla refined.