English · Español
03 — Data Pipelines at Scale¶
🇪🇸 La diferencia entre un modelo mediocre y uno bueno, a costo igualado, está en los datos. El pipeline canónico (CommonCrawl → filtro → deduplicación → tokenización → shards) no es opcional; cada paso elimina un orden de magnitud de basura y gana un punto porcentual en MMLU. FineWeb-Edu lo demuestra empíricamente.
A pretraining dataset for a >1B parameter model is not "a folder of text files." It is a pipeline that processes 10-100 TB of raw web crawl through 5-7 stages, each removing a category of noise.
The canonical pipeline¶
CommonCrawl WARC → text extraction → language filter →
quality filter → URL/content dedup → PII removal →
tokenize → shard into binary blobs
Per stage, typical retention rates (FineWeb paper, Penedo 2024): - WARC → trafilatura text extraction: ~80% drop (boilerplate/HTML). - → English-only (FastText): ~50% drop (95% of crawl is non-English). - → URL dedup: ~10% drop. - → fuzzy text dedup (MinHash): ~20% drop. - → quality classifier: 20-90% drop (depends on threshold). - → PII scrub: <1% drop.
Net: a 50 TB raw crawl yields ~1-5 TB of usable training text, depending on the quality bar.
Source: CommonCrawl¶
CommonCrawl (commoncrawl.org) is the canonical public web crawl. Released ~monthly as WARC archives. Each dump is ~80-100 TB compressed.
- Format: WARC (Web ARChive). Each record is HTTP response headers + body.
- License: crawl metadata is open; content is whatever the source site's license is. (Frontier labs and downstream datasets like Pile-CC, RefinedWeb, FineWeb all use it.)
- Coverage: ~3 billion pages per dump, ~50 dumps/year accessible.
- Limitations: robots.txt-compliant (large sites can opt out — the NYT and Reddit have); skewed toward English and Western web; outdated for live data.
Frontier teams typically train on the union of 10-20 CommonCrawl dumps (~500-1000 TB raw) plus high-quality non-crawl sources (Wikipedia, books, arXiv, GitHub).
Quality filters: the modern playbook¶
Heuristic filters (Gopher-style, Rae 2021): - Token length 50-100k per doc. - Mean word length 3-10 chars. - Symbol-to-word ratio < 0.1. - Bullet-point ratio < 0.9. - "Lorem ipsum" / placeholder text detection.
These are fast (regex-class), drop ~30% of post-language-filter text.
Classifier filters (FineWeb-Edu, Penedo 2024): - Train a small classifier (e.g. snowflake-arctic-embed) on (doc, score) pairs where score is from an LLM rating "educational value" 0-5. - Apply at scale; retain docs scoring ≥3. - Result: drops ~10× more data than heuristic-only, but quality goes way up. A 1.8B model on 350B FineWeb-Edu tokens beats 1T raw-CommonCrawl tokens on MMLU.
Domain filters (Llama-3 style): - Per-source weighting. Wikipedia × 3, arXiv × 5, "general web" × 1 type tuning. - This is the silent magic — published filter weights are rare.
Dedup: MinHash + LSH¶
Duplicate documents bias the loss curve and memorize verbatim. Dedup happens at two scales:
Exact (URL-level): trivial. SHA256 of URL.
Fuzzy (content-level): MinHash with locality-sensitive hashing. - Hash 5-grams of each doc to a 128-int sketch. - Bucket sketches into 8-band LSH. - Documents in the same bucket are candidate duplicates → Jaccard similarity check → drop if > 0.8.
Why this works: two documents that share 80% of their 5-grams are near-certainly duplicates (boilerplate templates, mirror sites). The MinHash sketch can be computed in one pass and compared in \(O(1)\) per pair within bucket.
Tools: datasketch (Python, slow at TB scale), text-dedup (HuggingFace, faster), Spark-based at frontier scale.
Throughput: dedup of 1 TB on a 64-core CPU box: ~6 hours. The classic bottleneck.
Tokenization: pretraining-specific concerns¶
The tokenizer choice is downstream of the model architecture, but the pipeline cost of tokenization matters:
- GPT-2 BPE (50,257 vocab): ~0.7 bytes/token on English. Pretokenize at ~5 MB/s/core.
- Llama-3 tokenizer (128,256 vocab): ~0.5 bytes/token on English. Larger vocab → fewer tokens to train on, more model params spent on embedding.
- SentencePiece (Llama-½, 32,000 vocab): ~0.6 bytes/token. Older, smaller-vocab default.
For X1: use the GPT-2 tokenizer. It's free, fast, well-documented, and the published Pile / FineWeb-Edu token counts are quoted in its tokens.
Tokenization throughput at scale: 5 TB text / 5 MB/s/core = \(10^6\) core-seconds = 11.6 core-days. On a 64-core box: ~4 hours. On a 1024-core cluster: ~17 minutes. This is not the bottleneck; dedup is.
Sharding for streaming¶
Once tokenized, the corpus is stored as binary shards — typically 100 MB to 1 GB each, in formats like:
mmapuint16arrays. Simplest. Stack token IDs asnumpy.uint16(works for vocabs ≤65,535). nanoGPT and Pythia use this.- WebDataset .tar shards. Each shard is a
.tarof small files; the reader streams them. Used by Mosaic Streaming. mosaicml/streamingMDS format. Optimized for resumable distributed reads.- Parquet. Slow for random access; used for raw text storage, not training-time reads.
For X1 lab 00, use the nanoGPT format: train.bin and val.bin, each a flat uint16 array. Reads at 500+ MB/s on NVMe, trivially memory-mappable, no tokenization at training time.
Throughput math: can the dataloader feed an A100?¶
A 50M-param model at MFU 0.40 on an A100 80GB processes:
- \(1.5 \times 10^{14}\) FLOP/s sustained / \(6 \cdot 5 \times 10^7\) FLOP/token = 500k tokens/s.
At 2 bytes/token (uint16): 1 MB/s of data read. NVMe SSDs do 3-7 GB/s. Plenty of headroom for 1× GPU.
At 8× H100 with TP=8 doing a 7B-param model: - Tokens/sec ≈ \(8 \cdot 0.45 \cdot 989 \times 10^{12} / (6 \cdot 7 \times 10^9)\) = 84,000 tokens/s. - At 2 bytes/token: 0.17 MB/s per replica. Cluster total: still <1 MB/s.
The model is always compute-bound, not data-bound, once tokenized. The whole "data pipeline at scale" concern is about getting from raw web to tokenized binary, not about feeding the model during training.
What FineWeb-Edu actually does (case study)¶
FineWeb (Penedo 2024) processes all 96 CommonCrawl dumps from 2013-2024:
- Extraction: trafilatura.
- Language: FastText, English-only.
- Quality (heuristic): Gopher filters tuned per-dump.
- Dedup: MinHash at the dump level (not cross-dump — the paper documents this choice as a memory/quality tradeoff).
- PII: regex-based, low-precision but high-recall.
Result: 15T tokens of "FineWeb-base" (high quality, large).
For FineWeb-Edu, they take an additional step: 6. Classifier: LLM-rated subset → train Snowflake-Arctic-embed classifier → apply at scale → retain top ~1.3T tokens.
The 1.3T FineWeb-Edu subset matches the 15T FineWeb-base on MMLU at \(1/10\) the training tokens. This is the data-quality scaling law in numbers.
Cost of building the dataset¶
For the X1 lab we do not build a dataset. We download a pre-built one. But the order of magnitude:
| Stage | Compute | $-cost (cloud spot CPU) |
|---|---|---|
| CommonCrawl download | 50-100 TB egress | $500-2000 (or free from S3 same-region) |
| Extraction + language filter | ~1000 core-hours | $30-100 |
| Quality + dedup | ~10,000 core-hours | $300-1000 |
| Tokenization | ~100 core-hours | $3-10 |
| Storage (1 yr) | 5-15 TB hot | $1500-5000 |
| Total per CC dump processed | — | $2k-10k |
Frontier labs process 20-50 dumps. Their dataset-prep cost is \(50k-\)500k pre any model training.
X1 shortcut: download a pre-tokenized FineWeb-Edu sample (~5-10B tokens, ~20 GB) directly from HuggingFace. Total prep time: 10 min. Total prep cost: < $1.
Data quality > data quantity, again¶
The Chinchilla paper assumes data is fungible — a token is a token. Reality: a token from FineWeb-Edu is worth ~5-10 raw-CommonCrawl tokens for downstream accuracy.
This breaks the "20:1" rule in a specific way: for fixed-compute, train on the best data you can get, even if it means D < 20N. A 50M-param model on 3B high-quality tokens beats the same model on 5B mixed-quality tokens.
For X1: choose FineWeb-Edu over raw Pile-CC. Same model, same compute, ~5% lower val loss.
What's intentionally missing¶
- Code data. Code (GitHub) is 10-20% of frontier corpora and has its own pipeline (license filtering, language detection, quality scoring). The Stack v2 (Lozhkov 2024) is the canonical dataset. X1 ignores code — we're training a small text-only model.
- Math data. OpenWebMath, ProofPile. Same — not in X1 scope.
- Books / arXiv / Wikipedia. Frontier mix includes ~10% non-web text. X1 sticks to FineWeb-Edu for simplicity.
- Curriculum learning / staged data. Llama-3 uses a data curriculum (lower-quality early, higher-quality late). X1's 24-hour budget doesn't justify staging.
- Continued pretraining / mid-training data swaps. Phase 19 territory + frontier-lab knowledge. Out of X1 scope.
Next: theory/04-training-stability-at-scale.md.