Skip to content

English · Español

Extension X1 — Pretraining at Scale

Requires: 35 — Distributed Training & Inference Teaches: scaling-laws · chinchilla · mfu · pretraining · cluster-economics Jump to any chapter from the phase reference index.

Chapter map

Extension track. Authorized by §A15 (extension addendum, parallel). This module sits outside the 40-phase core curriculum and closes the "no actual large pretraining" gap flagged in HIRING_PATH.md §"Honest gaps" (line 262). It is not a scope expansion of the §A13 microscopic universe: the verb-grammar tutor remains the curriculum capstone. X1 is the post-curriculum module where Borja runs one real pretraining job on a real cloud GPU so that the words "loss spike," "Chinchilla-optimal," and "MFU 0.45" map to lived numbers, not blog posts.

🇪🇸 X1 cierra el hueco de pre-entrenamiento real señalado en HIRING_PATH.md. Una sola corrida de 24 h en la nube, 50M parámetros, ~5B tokens, presupuesto duro de $50 — suficiente para sentir las dinámicas (curva de loss, spikes, MFU, costo por token) sin entrar en presupuesto de laboratorio frontera.


Status

  • Type: post-curriculum extension (not a phase).
  • Prerequisites: Phase 18 (training loop), Phase 26 (quantization, for the inference-cost epilogue), Phase 35 (distributed parallelism vocabulary). The lab assumes you have already shipped a working mlflow + safetensors + manifest pipeline from Phase 18.
  • Hardware: 1× cloud GPU (A100 80GB target). No multi-node. Multi-GPU is described in theory only — runs are single-GPU.
  • Spend ceiling: $50 (hard cap, enforced by budget_guard.py reused from Phase 35).
  • Duration: ~3 days of wall-clock effort over ~2 weeks elapsed (theory reading + one ~24 h cloud run + one scaling-law triple).

What this extension teaches

By the end of X1 you can:

  1. State and apply the Chinchilla compute-optimal rule (\(N_\text{opt} \approx 20 \cdot D\), where \(N\) is parameter count and \(D\) is training tokens) and explain when it under- vs over-estimates.
  2. Estimate the dollar cost of training an arbitrary \(N\)-parameter model to \(D\) tokens on a specified cluster, including MFU, $/GPU-hour spot/on-demand, and energy.
  3. Run a 24-hour single-GPU pretraining job on lambda.ai or runpod.io, train a ~50M-parameter decoder-only transformer on ~5B tokens of Pile-CC / FineWeb-Edu, and produce a reproducible loss curve.
  4. Fit a scaling law from a 3-point sweep (5M / 25M / 100M params, fixed token budget) and predict the compute-optimal token count for a hypothetical 1B-param model.
  5. Diagnose a loss spike post-hoc from a mlflow run: identify the offending batch range, classify the failure mode (rare-token gradient, Adam β₂ explosion, learning-rate-too-high, dataset corruption), and propose the appropriate recovery (skip-batch, restart-from-prior-ckpt, clip-tighter, μP rescale).
  6. Read a real frontier pretraining log (OLMo, Pythia, Falcon, Llama) and name the architectural and data choices that drove each line.

Read order

Theory (read first, in order)

  1. theory/00-motivation.md — why one cloud run; the gap this closes; what you can and cannot learn from $50 of pretraining.
  2. theory/01-scaling-laws.md — Kaplan 2020, Hoffmann (Chinchilla) 2022; the 20-token/param rule; data-vs-params Pareto; over- and under-training regimes.
  3. theory/02-cluster-economics.md — H100 / A100 / H200 specs; \(/GPU-hour spot vs on-demand; bandwidth costs; energy; the 7B-for-\)1M math worked out.
  4. theory/03-data-pipelines-at-scale.md — CommonCrawl → quality filter → dedup → tokenize → shard; FineWeb-Edu; throughput math; sharding for streaming.
  5. theory/04-training-stability-at-scale.md — loss spikes, μP, weight decay; recovery procedures; mid-training interventions (LR resets, curriculum, data swaps).

Lab (do after theory)

  1. lab/00-one-day-cloud-pretraining.md — the one real run. ~24 h on 1× A100 80GB. Hard $50 cap.
  2. lab/01-scaling-laws-experiment.md — three smaller runs (5M / 25M / 100M params) on a fixed compute budget. Fit Chinchilla curve. Predict 1B-param optimal.
  • Phase 18 — Training loop. The optimizer, scheduler, checkpoint, and mlflow wiring you ship in Phase 18 is the substrate X1 runs on top of. The X1 lab adds a streaming dataloader and a token-budget runner — everything else is reused. (docs/phase-18-training-loop/)
  • Phase 26 — Quantization. After pretraining, you quantize the 50M checkpoint to int8 / int4 and measure inference $/token. The capstone artifact of X1 is a cost-per-token number for the model you trained versus the open-weights baselines. (docs/phase-26-quantization/)
  • Phase 35 — Distributed. X1 is single-GPU. The "what changes at >10B params" content of theory/02 and theory/04 is grounded in Phase 35's TP/PP/ZeRO-3 vocabulary. If those words don't yet map to mechanism, finish Phase 35 first. (docs/phase-35-distributed/)
  • Phase 36 — Frontier architectures. The model-shape choice for the lab (depth, width, head count, GQA on/off) cites Phase 36's design space. The X1 lab uses a mid-2024-default shape (32 layers × 768 width × 12 heads × GQA-4); Phase 36 is where you'd justify deviating from it. (docs/phase-36-frontier-architectures/)

Definition of Done (extension-track DoD)

Extension tracks do not have a PHASE_NN_REPORT.md (they're outside the 40-phase ritual). X1 ships these four binary checks:

  1. One reproducible cloud run. experiments/X1-pretraining/run-cloud/ contains: manifest.json (seed, versions, config, cluster, $-spent), mlflow URI, loss-curve.png, and safetensors checkpoint. The run hit the 24-hour wall-clock and stayed under $50.
  2. Scaling-law triple done. experiments/X1-pretraining/scaling-law/ has three runs (5M, 25M, 100M params), one CSV of (params, tokens, val-loss), one fitted curve plot, and one written prediction for 1B-param compute-optimal token count with a confidence interval.
  3. Loss-spike post-mortem. experiments/X1-pretraining/spike-postmortem.md exists with: timestamp of detected spike (synthetic if no real one occurred — inject one), classification, evidence (gradient norm log, batch-token-histogram), recovery action taken, and outcome.
  4. Quiz /quiz X1 ≥ 80%.

A short reflections.md in learners/borja/extension-X1/ covers: what surprised me about MFU at this scale; how Chinchilla-optimal differed from my pre-run intuition; which cost line dominated (compute / bandwidth / storage / human-hours).

Cost estimate (concrete)

Line item Detail $
Lab 00 (one-day pretraining) 1× A100 80GB spot @ ~$1.10/hr × 24 h $26
Lab 00 storage egress 200 GB tokenized data, hot for 24 h $3
Lab 01 scaling-law triple 3× ~4 h runs on same node = 12 h × $1.10/hr $13
Buffer for restarts / debug ~10% padding $5
Total ceiling enforced by budget_guard.py $50

If the actual spot price is higher (A100 80GB has seen $1.79/hr on Lambda on-demand, $0.79/hr lowest-quartile on RunPod community spot), budget_guard.py refuses the launch and you re-plan. Do not silently exceed.

What this extension intentionally does NOT cover

  • Multi-node training. Single GPU only. Inter-node InfiniBand / NVLink topologies are covered in theory/02-cluster-economics.md for vocabulary, not run.
  • A frontier-scale model. 50M params is the run size. We extrapolate to 1B / 7B / 70B in theory; we do not train them.
  • Custom CUDA kernels. Phase 24 covers Triton. X1 uses stock torch.compile / FlashAttention-2 from pip.
  • MoE / sparse models. Phase 36 territory. X1 stays dense.
  • RLHF / SFT on top of the pretrained model. Phase 31 / hypothetical X3 territory. X1 stops at the base model.
  • Tokenizer training. We use a pretrained tokenizer (GPT-2 BPE or Llama-3 tokenizer). Tokenizer training at scale is Phase 11 territory; the data-scale of that step is dwarfed by the model-training step we focus on here.
  • Hyperparameter search. One configuration, derived from theory. If it diverges, we post-mortem and re-launch once; we do not sweep.

Build-before-abstract policy for X1

The core curriculum rule (CLAUDE.md §0.4): no PyTorch before Phase 25, no transformers before Phase 24. By the time you reach X1, both gates are open. X1 uses:

  • PyTorch 2.4+ with torch.compile for the training loop.
  • transformers.GPT2Tokenizer (or the Llama-3 tokenizer) — we do not retrain a tokenizer.
  • datasets for streaming Pile-CC / FineWeb-Edu.
  • flash-attn for the attention kernel (CUDA-only; gated behind a runtime check).
  • NO accelerate, NO deepspeed, NO lightning. Single-GPU is single-process; we do not need a framework to launch one Python script.

The training script itself is a single ~400-line nanoGPT-shaped file in src/x1_pretrain/train.py. The point is to keep every variable visible — when the loss spikes at hour 14, you can read the loop top-to-bottom without leaving the file.


Next: theory/00-motivation.md.

Further reading

Optional — enrichment, not required to pass the phase.