Skip to content

English · Español

00 — Motivation: multimodality as a separate paradigm

🇪🇸 Multimodalidad no es "concatenar tokens de imagen y de texto y entrenar un transformer encima". El problema central es que cada modalidad vive en un espacio de representación distinto (la modality gap), y forzarlas al mismo espacio cuesta datos, parámetros y diseño cuidado. Esta sección explica por qué.

This file answers: why is multi-modal modeling a distinct paradigm, and not just "tokens with extra steps"?


The naive view (and why it's wrong)

A first reading of CLIP or Flamingo papers suggests: "we tokenize images into patch-tokens, audio into frame-tokens, text into BPE tokens, dump them all into a transformer, train next-token prediction. Done."

The naive view is mechanically correct (you can in fact do this — Chameleon and Gemini 1.5 essentially do, see theory/03-fusion-strategies.md) but practically misses the central engineering problem:

Different modalities have wildly different statistical structure, information density, and scale. Putting them into the same model without thinking about these mismatches produces a model that ignores one modality.

Concretely:

  • Information density. One 224×224 image, naively patchified at 16×16, is 196 tokens. One spoken English sentence at 16 kHz over 5 s is 5 × 100 = 500 mel-frames before downsampling. The text caption "a cat sits on a mat" is 8 BPE tokens. Three orders of magnitude variation in tokens-per-meaning across modalities for the same semantic content.
  • Statistical structure. Text is discrete, low-entropy at the token level (most tokens are highly predictable from prior context), and obeys Zipfian frequency. Image patches are continuous-valued (or quantized to 8k-codebook in some schemes), high-entropy at the patch level, and visually-spatially structured (adjacent patches are correlated). Audio frames are continuous, periodic in spectral structure, and have a strong temporal smoothness prior. A single model architecture cannot naively be optimal for all three.
  • Scale of pretraining. Text LLMs train on ~10 T tokens. CLIP trained on 400 M image-text pairs (≈ 80 B "tokens" if you count patch+text). Whisper trained on 680 k hours of audio (≈ 24 B audio frames). The data orders of magnitude differ — and the labels (paired text-image vs. transcribed audio) are far harder to obtain than raw text.

The modality gap

A specific, measurable phenomenon. CLIP-style contrastive training learns to map paired (image, text) into the same embedding space. After training, if you look at the embeddings of all images vs. all texts, you find two disjoint clusters.

Specifically:

  • The cosine distance between the centroid of all image embeddings and the centroid of all text embeddings is ~0.5 (out of a maximum of 2 for opposite directions).
  • A linear classifier trained on the union of (image embedding, text embedding) → "is this image or text?" achieves > 99% accuracy.

This is the modality gap (Liang et al. 2022, "Mind the Gap"). It's surprising because the contrastive loss explicitly pushes paired image and text together — yet the modalities end up in separate cones of the unit sphere.

Why it happens: initialization. The image encoder and text encoder are initialized independently, with different architectures (ViT for image, transformer for text). Their initial outputs land in different regions of \(\mathbb{R}^d\). The contrastive loss pulls paired points together, but the gradient signal isn't strong enough to fully erase the gap — there's an energy barrier between "shrink the gap globally" (which would require restructuring the whole representation space) and "make this specific pair close enough to be top-1" (a local fix).

Why it matters in practice:

  • Zero-shot classification ("which of these 1000 class-label texts is closest to this image?") works because within each modality, the cosine distances are still informative — the image is closer to the right class-text than to a random class-text. But the absolute scores aren't comparable across modalities.
  • Image-to-image and text-to-text retrieval inside the same model often outperforms image-to-text. The model is implicitly doing two separate retrieval systems that happen to be mildly aligned.
  • Fusion architectures that bridge the gap explicitly (LLaVA's projector MLP, Flamingo's gated cross-attention) often outperform pure-contrastive at downstream tasks — they get to use the same parameters across modalities, instead of bolting two encoders together.

Two paradigms: contrastive vs autoregressive multimodal

There are two large families of multimodal models. The distinction is which loss they minimize.

Contrastive multimodal (CLIP, SigLIP, ALIGN, BLIP-1)

  • Architecture: two encoders (image, text). No decoder. Optionally a projection head per modality.
  • Loss: contrastive — pull paired (image, text) embeddings together; push non-paired apart. Concretely: symmetric InfoNCE (theory/01-vision-transformers.md §"CLIP loss").
  • Output: an embedding per modality. To do anything downstream (classify, retrieve, caption), you need a separate head.
  • Strength: fast inference, great for retrieval and zero-shot classification, no generation cost.
  • Weakness: cannot generate text describing an image. Cannot answer questions about an image. The model "knows" the image-text relationship implicitly, but cannot articulate it.
  • Compute: CLIP-L on 400 M pairs, 32 V100s, ~12 days. Fully data-scaling-bound.

Autoregressive multimodal (Flamingo, LLaVA, GPT-4V, Gemini, Chameleon)

  • Architecture: a base LLM (decoder transformer) + an image encoder + a fusion mechanism that injects image information into the LLM's residual stream. Optionally an image decoder for image generation.
  • Loss: standard next-token cross-entropy on text, with image tokens (or projected image features) prepended/interleaved into the context.
  • Output: generated text, conditioned on image input.
  • Strength: can do anything you can describe in text. Captioning, VQA, OCR, chart-reading, multi-turn vision-language dialogue.
  • Weakness: generation is expensive; the autoregressive bottleneck applies. The image encoder is often frozen during fusion training (LLaVA), which means the image representation is whatever a contrastive model already produced.
  • Compute: LLaVA-1.5 trained on ≤ 1 M instruction pairs in ≤ 1 GPU-day for the projector + LoRA. Cheap if you reuse a pretrained encoder + LLM.

Hybrid (Chameleon, Gemini, GPT-4o)

Discretize images into VQ-VAE-style tokens, drop them into the same vocabulary as text, train next-token prediction across the unified token stream. This unifies the two paradigms — same loss, same architecture across modalities — but requires extremely large compute (Gemini 1.5 pretraining is on the order of \(10^{25}\) FLOPs). The training-data scaling for unified models is theory/03-fusion-strategies.md §"unified token fusion".


What the grammar tutor (Phase 32) becomes with vision

Phase 32 is the capstone agent: read an English sentence, propose conjugation corrections. With X2, the natural extension is:

Multi-modal grammar tutor. The user uploads:

  • A photo of a textbook page with a sentence, OR
  • A 5-second audio clip of someone saying a sentence,

and the tutor (a) extracts the sentence (OCR via vision or ASR via Whisper), (b) parses the verb forms, © flags errors and proposes corrections (the existing Phase 32 pipeline).

This is the use case X2 is preparing for. We don't build this capstone in X2 (that would belong to a hypothetical "X4 — Multi-modal capstone"), but the labs verify each component:

  • Lab 00 (ViT) — verifies you can classify tense from a visual icon. A small step toward OCR.
  • Lab 01 (CLIP-style) — verifies image ↔ text alignment on the grammar domain. The retrieval mechanism a tutor would use to match a photographed sentence against its known grammar templates.
  • Lab 02 (Whisper) — verifies you can transcribe spoken verb-form audio into text. The ASR front-end of the audio path.

When to use multimodality (and when not to)

Multi-modal adds two costs: engineering complexity (you maintain two pretraining pipelines, two evaluators, often two teams) and inference latency (image encoders are typically 50–500 ms on CPU, audio encoders 1–5 s for 30 s of audio).

Use multi-modal when:

  • The information is genuinely non-text (a photo of a broken bone, a recording of an engine knock, a 3-D point cloud of a fab line). Anything you can OCR or ASR into text is best handled by OCR/ASR → text-LLM pipeline.
  • The cross-modal grounding is part of the answer (visual question answering, image-conditioned dialogue).
  • You need real-time interaction with a physical scene (robotics, augmented-reality assistants).

Use text-only when:

  • The input was originally text (PDFs, code, transcripts). Convert and don't re-introduce modality.
  • Latency or cost matters more than the marginal % gained from images.
  • You don't have multi-modal training data and your task is niche enough that pretrained CLIP / LLaVA won't transfer.

What's next

Read theory/01-vision-transformers.md for the visual side, theory/02-audio-models.md for the audio side. Then 03-fusion-strategies.md and 04-llava-and-vision-language.md cover how the two modalities are stitched into a single model.