Skip to content

English · Español

03 — Data Pipelines at Scale

🇪🇸 La diferencia entre un modelo mediocre y uno bueno, a costo igualado, está en los datos. El pipeline canónico (CommonCrawl → filtro → deduplicación → tokenización → shards) no es opcional; cada paso elimina un orden de magnitud de basura y gana un punto porcentual en MMLU. FineWeb-Edu lo demuestra empíricamente.

A pretraining dataset for a >1B parameter model is not "a folder of text files." It is a pipeline that processes 10-100 TB of raw web crawl through 5-7 stages, each removing a category of noise.

The canonical pipeline

CommonCrawl WARC → text extraction → language filter →
quality filter → URL/content dedup → PII removal →
tokenize → shard into binary blobs

Per stage, typical retention rates (FineWeb paper, Penedo 2024): - WARC → trafilatura text extraction: ~80% drop (boilerplate/HTML). - → English-only (FastText): ~50% drop (95% of crawl is non-English). - → URL dedup: ~10% drop. - → fuzzy text dedup (MinHash): ~20% drop. - → quality classifier: 20-90% drop (depends on threshold). - → PII scrub: <1% drop.

Net: a 50 TB raw crawl yields ~1-5 TB of usable training text, depending on the quality bar.

Source: CommonCrawl

CommonCrawl (commoncrawl.org) is the canonical public web crawl. Released ~monthly as WARC archives. Each dump is ~80-100 TB compressed.

  • Format: WARC (Web ARChive). Each record is HTTP response headers + body.
  • License: crawl metadata is open; content is whatever the source site's license is. (Frontier labs and downstream datasets like Pile-CC, RefinedWeb, FineWeb all use it.)
  • Coverage: ~3 billion pages per dump, ~50 dumps/year accessible.
  • Limitations: robots.txt-compliant (large sites can opt out — the NYT and Reddit have); skewed toward English and Western web; outdated for live data.

Frontier teams typically train on the union of 10-20 CommonCrawl dumps (~500-1000 TB raw) plus high-quality non-crawl sources (Wikipedia, books, arXiv, GitHub).

Quality filters: the modern playbook

Heuristic filters (Gopher-style, Rae 2021): - Token length 50-100k per doc. - Mean word length 3-10 chars. - Symbol-to-word ratio < 0.1. - Bullet-point ratio < 0.9. - "Lorem ipsum" / placeholder text detection.

These are fast (regex-class), drop ~30% of post-language-filter text.

Classifier filters (FineWeb-Edu, Penedo 2024): - Train a small classifier (e.g. snowflake-arctic-embed) on (doc, score) pairs where score is from an LLM rating "educational value" 0-5. - Apply at scale; retain docs scoring ≥3. - Result: drops ~10× more data than heuristic-only, but quality goes way up. A 1.8B model on 350B FineWeb-Edu tokens beats 1T raw-CommonCrawl tokens on MMLU.

Domain filters (Llama-3 style): - Per-source weighting. Wikipedia × 3, arXiv × 5, "general web" × 1 type tuning. - This is the silent magic — published filter weights are rare.

Dedup: MinHash + LSH

Duplicate documents bias the loss curve and memorize verbatim. Dedup happens at two scales:

Exact (URL-level): trivial. SHA256 of URL.

Fuzzy (content-level): MinHash with locality-sensitive hashing. - Hash 5-grams of each doc to a 128-int sketch. - Bucket sketches into 8-band LSH. - Documents in the same bucket are candidate duplicates → Jaccard similarity check → drop if > 0.8.

Why this works: two documents that share 80% of their 5-grams are near-certainly duplicates (boilerplate templates, mirror sites). The MinHash sketch can be computed in one pass and compared in \(O(1)\) per pair within bucket.

Tools: datasketch (Python, slow at TB scale), text-dedup (HuggingFace, faster), Spark-based at frontier scale.

Throughput: dedup of 1 TB on a 64-core CPU box: ~6 hours. The classic bottleneck.

Tokenization: pretraining-specific concerns

The tokenizer choice is downstream of the model architecture, but the pipeline cost of tokenization matters:

  • GPT-2 BPE (50,257 vocab): ~0.7 bytes/token on English. Pretokenize at ~5 MB/s/core.
  • Llama-3 tokenizer (128,256 vocab): ~0.5 bytes/token on English. Larger vocab → fewer tokens to train on, more model params spent on embedding.
  • SentencePiece (Llama-½, 32,000 vocab): ~0.6 bytes/token. Older, smaller-vocab default.

For X1: use the GPT-2 tokenizer. It's free, fast, well-documented, and the published Pile / FineWeb-Edu token counts are quoted in its tokens.

Tokenization throughput at scale: 5 TB text / 5 MB/s/core = \(10^6\) core-seconds = 11.6 core-days. On a 64-core box: ~4 hours. On a 1024-core cluster: ~17 minutes. This is not the bottleneck; dedup is.

Sharding for streaming

Once tokenized, the corpus is stored as binary shards — typically 100 MB to 1 GB each, in formats like:

  • mmap uint16 arrays. Simplest. Stack token IDs as numpy.uint16 (works for vocabs ≤65,535). nanoGPT and Pythia use this.
  • WebDataset .tar shards. Each shard is a .tar of small files; the reader streams them. Used by Mosaic Streaming.
  • mosaicml/streaming MDS format. Optimized for resumable distributed reads.
  • Parquet. Slow for random access; used for raw text storage, not training-time reads.

For X1 lab 00, use the nanoGPT format: train.bin and val.bin, each a flat uint16 array. Reads at 500+ MB/s on NVMe, trivially memory-mappable, no tokenization at training time.

Throughput math: can the dataloader feed an A100?

A 50M-param model at MFU 0.40 on an A100 80GB processes:

  • \(1.5 \times 10^{14}\) FLOP/s sustained / \(6 \cdot 5 \times 10^7\) FLOP/token = 500k tokens/s.

At 2 bytes/token (uint16): 1 MB/s of data read. NVMe SSDs do 3-7 GB/s. Plenty of headroom for 1× GPU.

At 8× H100 with TP=8 doing a 7B-param model: - Tokens/sec ≈ \(8 \cdot 0.45 \cdot 989 \times 10^{12} / (6 \cdot 7 \times 10^9)\) = 84,000 tokens/s. - At 2 bytes/token: 0.17 MB/s per replica. Cluster total: still <1 MB/s.

The model is always compute-bound, not data-bound, once tokenized. The whole "data pipeline at scale" concern is about getting from raw web to tokenized binary, not about feeding the model during training.

What FineWeb-Edu actually does (case study)

FineWeb (Penedo 2024) processes all 96 CommonCrawl dumps from 2013-2024:

  1. Extraction: trafilatura.
  2. Language: FastText, English-only.
  3. Quality (heuristic): Gopher filters tuned per-dump.
  4. Dedup: MinHash at the dump level (not cross-dump — the paper documents this choice as a memory/quality tradeoff).
  5. PII: regex-based, low-precision but high-recall.

Result: 15T tokens of "FineWeb-base" (high quality, large).

For FineWeb-Edu, they take an additional step: 6. Classifier: LLM-rated subset → train Snowflake-Arctic-embed classifier → apply at scale → retain top ~1.3T tokens.

The 1.3T FineWeb-Edu subset matches the 15T FineWeb-base on MMLU at \(1/10\) the training tokens. This is the data-quality scaling law in numbers.

Cost of building the dataset

For the X1 lab we do not build a dataset. We download a pre-built one. But the order of magnitude:

Stage Compute $-cost (cloud spot CPU)
CommonCrawl download 50-100 TB egress $500-2000 (or free from S3 same-region)
Extraction + language filter ~1000 core-hours $30-100
Quality + dedup ~10,000 core-hours $300-1000
Tokenization ~100 core-hours $3-10
Storage (1 yr) 5-15 TB hot $1500-5000
Total per CC dump processed $2k-10k

Frontier labs process 20-50 dumps. Their dataset-prep cost is \(50k-\)500k pre any model training.

X1 shortcut: download a pre-tokenized FineWeb-Edu sample (~5-10B tokens, ~20 GB) directly from HuggingFace. Total prep time: 10 min. Total prep cost: < $1.

Data quality > data quantity, again

The Chinchilla paper assumes data is fungible — a token is a token. Reality: a token from FineWeb-Edu is worth ~5-10 raw-CommonCrawl tokens for downstream accuracy.

This breaks the "20:1" rule in a specific way: for fixed-compute, train on the best data you can get, even if it means D < 20N. A 50M-param model on 3B high-quality tokens beats the same model on 5B mixed-quality tokens.

For X1: choose FineWeb-Edu over raw Pile-CC. Same model, same compute, ~5% lower val loss.

What's intentionally missing

  • Code data. Code (GitHub) is 10-20% of frontier corpora and has its own pipeline (license filtering, language detection, quality scoring). The Stack v2 (Lozhkov 2024) is the canonical dataset. X1 ignores code — we're training a small text-only model.
  • Math data. OpenWebMath, ProofPile. Same — not in X1 scope.
  • Books / arXiv / Wikipedia. Frontier mix includes ~10% non-web text. X1 sticks to FineWeb-Edu for simplicity.
  • Curriculum learning / staged data. Llama-3 uses a data curriculum (lower-quality early, higher-quality late). X1's 24-hour budget doesn't justify staging.
  • Continued pretraining / mid-training data swaps. Phase 19 territory + frontier-lab knowledge. Out of X1 scope.

Next: theory/04-training-stability-at-scale.md.