English · Español
Extension X2 — Multi-modal Models¶
Requires: 15 — Attention from Scratch · 17 — Tiny Transformer Block & Mini-GPT · 28 — Fine-Tuning, LoRA, QLoRA Teaches:
vit·clip·whisper·contrastive-learning·multimodal-fusionJump to any chapter from the phase reference index.
Chapter map¶
Extension track. Authorized by §A15 (extension addendum, parallel). This module sits outside the 40-phase core curriculum and closes the "vision + audio" gap flagged in
HIRING_PATH.md§"Honest gaps" (the multi-modal disclaimer at line 263). It is not a scope expansion of the §A13 microscopic universe: the model trained in the labs here still operates over the 20-verb × 5-tense × 3-person grammar set. We add modalities (image, audio), not vocabulary.🇪🇸 X2 cierra el hueco de visión y audio del
HIRING_PATH.mdsin romper el alcance microscópico de §A13. La gramática inglesa sigue siendo el universo; lo que cambia es que ahora un ícono o un clip de audio pueden ser entradas válidas. La pregunta central: cómo dos modalidades distintas (imagen + texto, o audio + texto) se alinean en un mismo espacio de representación.
What this extension teaches¶
By the end of X2 you can:
- Derive patchify → ViT forward pass, including the einsum that turns a
(224, 224, 3)image into 196 tokens of dim 768. - Implement a 4-block ViT on top of the Phase 17 transformer block, train it on a synthetic icon-classification task, and reach > 80% top-1 in ≤ 5 min on CPU.
- Implement a CLIP-style contrastive trainer (symmetric InfoNCE) on (icon, verb-form-text) pairs and benchmark top-k retrieval.
- Explain Whisper's log-mel frontend numerically (16 kHz → 25 ms / 10 ms → 80-mel × 3000 frames → conv-downsampled to 1500), load
whisper-tiny.en, run inference, and inspect cross-attention and timestamp logits. - Compare late fusion (CLIP), early fusion (Flamingo gated cross-attention), and unified-token fusion (Chameleon, Gemini) — and articulate when each wins.
- Position LLaVA's projector trick (CLIP 1024-dim → LLM 4096-dim via a tiny MLP) on the cost/quality Pareto.
- Read an MM-Bench vs MMMU comparison and say what each score actually measures.
Read order¶
Theory (read first, in order)¶
theory/00-motivation.md— multimodality as a paradigm; modality gap; contrastive vs autoregressive.theory/01-vision-transformers.md— ViT (Dosovitskiy 2020), CLIP (Radford 2021), DINOv2 / SigLIP.theory/02-audio-models.md— log-mel intuition; Whisper; HuBERT / wav2vec 2.0; streaming caveats.theory/03-fusion-strategies.md— late / early / unified-token fusion; training-data scaling.theory/04-llava-and-vision-language.md— LLaVA recipe and eval landscape (MM-Bench, MME, MMMU).
Lab (do after each theory cluster)¶
lab/00-tiny-vit-on-grammar-icons.md— after theory 01.lab/01-clip-style-grammar-image-pairing.md— after theory 01.lab/02-whisper-inference-walkthrough.md— after theory 02.
Cross-links¶
- Phase 15 — Attention. Cross-attention is the structural device that makes CLIP, Flamingo, and Whisper work. If §A13's grammar example showed you why attention exists, X2 shows you what it does when the keys come from a different modality than the queries. (
docs/phase-15-attention/) - Phase 17 — mini-GPT. The ViT in lab 00 is literally the Phase 17 transformer block, swapped from causal-LM to bidirectional, with a patchify embedding instead of token embedding. The reuse is the lesson: a transformer block is modality-agnostic. (
docs/phase-17-mini-gpt/) - Phase 26 — Quantization. ViT-base and Whisper-tiny are both prime quantization candidates (most of the compute is matmul, most of the bandwidth is weights). Inference-on-CPU is gated on int8 / int4 paths from Phase 26.
What this extension intentionally does NOT cover¶
- Pretraining a ViT or CLIP from scratch on real ImageNet / LAION. We train on the synthetic grammar-icon dataset — 1000 images, generated by a Pillow script, CPU-trainable in 5 minutes. The mechanics are the goal; the data scale is not.
- Training Whisper or HuBERT. Lab 02 is inference only on a pretrained checkpoint. Audio pretraining at the data scale Whisper used (680 k hours) is out of scope for any non-frontier-lab setup.
- Video models (V-JEPA, VideoMAE, Sora-style diffusion-transformers). One modality at a time; video is a follow-up extension (X3 if it ever exists).
- Robotics / embodied multimodal (RT-2, OpenVLA, π0). Out of scope; flagged as a separate hiring-path branch in
HIRING_PATH.md. - Diffusion models for image generation. Different paradigm (score matching, not next-token / contrastive). Belongs to a hypothetical X1 generative-models extension.
- Multimodal RLHF / RLAIF (e.g. Llama-3-Vision instruction tuning). Phase 31 covers single-modality RLHF; multimodal extension deferred.
Build-before-abstract policy for X2¶
The core curriculum's rule is: no PyTorch before Phase 25, no transformers before Phase 24 (CLAUDE.md §0.4).
For X2:
- Lab 00 (ViT) — NumPy + Phase 17's NumPy transformer block. No framework. Same rule as the core curriculum.
- Lab 01 (CLIP-style) — NumPy + Phase 17 block. No framework. Same rule.
- Lab 02 (Whisper) — uses
transformers.WhisperForConditionalGeneration. This is the exception. Justification: (a) lab 02 is inference-only — no training to demystify; (b) writing log-mel + conv stem + 4-layer encoder-decoder from scratch is a ~3-week project that adds zero new conceptual insight beyond whattheory/02-audio-models.mdalready derives; © the extension track is explicitly post-Phase-24 territory in spirit. The exception is local to lab 02. If you find yourself reaching fortransformersin lab 00 or 01, stop — the labs are designed to be doable in NumPy.
Definition of Done (extension-track DoD)¶
Extension tracks do not have a PHASE_NN_REPORT.md (they're outside the 40-phase ritual). They produce:
experiments/X2-multimodal/lab-{00,01,02}/directories with manifests + reproducible outputs.- A short
reflections.mdinlearners/borja/extension-X2/covering: which fusion strategy I'd pick for which use case; what surprised me about Whisper's cross-attention pattern; how I'd extend the grammar tutor (Phase 32) to accept image input. - Quiz
/quiz X2passed at ≥ 80%.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 An Image is Worth 16x16 Words (ViT) — Dosovitskiy et al. · 2020. the transformer applied to image patches.
- 📄 Learning Transferable Visual Models from Natural Language (CLIP) — Radford et al. · 2021. contrastive image-text alignment.
- 📄 Visual Instruction Tuning (LLaVA) — Liu et al. · 2023. wiring a vision encoder into an LLM.