English · Español

03 — Fusion Strategies: Late, Early, Unified-Token¶

🇪🇸 Fusionar modalidades es el problema central: cómo combinar señales de visión, audio y texto dentro del mismo modelo. Hay tres familias: late fusion (CLIP — dos encoders, alineados por contraste), early fusion (Flamingo — cross-attention con gates aprendibles), y unified-token fusion (Chameleon, Gemini — todo es un token en el mismo vocabulario). Esta sección compara las tres en arquitectura, datos, y cuándo elegir cuál.

This file answers: once we have a vision encoder and a text encoder, what are the ways to combine them into a single model, and what are the trade-offs?

References: - Radford et al., CLIP (late fusion), 2021. (arXiv:2103.00020) - Alayrac et al., Flamingo: a Visual Language Model for Few-Shot Learning (early fusion / gated cross-attn), DeepMind, 2022. (arXiv:2204.14198) - Team Chameleon, Chameleon: Mixed-Modal Early-Fusion Foundation Models (unified-token), Meta, 2024. (arXiv:2405.09818) - Gemini Team, Gemini: A Family of Highly Capable Multimodal Models, Google, 2023. (arXiv:2312.11805)

The taxonomy¶

Three families, distinguished by where in the model the modalities are combined:

Family	Where modalities meet	Example	Training data	Inference cost
Late fusion	At the output, via aligned embedding spaces	CLIP, SigLIP	100M+ (image, text) pairs	cheap (encode each modality once)
Early fusion (cross-attn)	Inside the LLM, via cross-attention to a vision encoder's outputs	Flamingo, LLaVA	1M–10M instruction pairs (after pretrained encoder + LLM)	medium (cross-attn on each LLM layer)
Unified-token fusion	At the input vocabulary	Chameleon, Gemini, AnyGPT, GPT-4o	1B+ multi-modal tokens, end-to-end	high (image tokens take many text-token slots in context)

Below: each family in detail.

Late fusion (CLIP-style)¶

Architecture: two independent encoders. After training, you have two functions $f_I: \text{image} \to \mathbb{R}^d$ and $f_T: \text{text} \to \mathbb{R}^d$, with a learned alignment such that paired (image, text) embeddings are close in cosine.

Training: symmetric InfoNCE on (image, text) pairs (see 01-vision-transformers.md).

Inference: to compare an image and a text, compute $f_I(\text{image})$ and $f_T(\text{text})$ once each, then take cosine similarity. This is constant-cost in the number of comparisons (cache the embeddings).

What it's good for: - Image retrieval ("find me images matching this caption"). - Zero-shot classification. - As a feature extractor for downstream models (LLaVA freezes CLIP's image encoder; the projector MLP learns to map CLIP features into the LLM's residual stream).

What it cannot do: - Generation. No decoder. You cannot ask CLIP "describe this image". - Fine-grained reasoning. A single embedding-per-modality loses spatial detail. CLIP knows there's "a dog" in the image but cannot answer "how many dogs?" reliably. - Multi-turn dialogue. No memory mechanism beyond the single forward pass.

Why CLIP is still important: the image encoder pretrained by contrastive loss on web-scale data is the best available initialization for downstream multi-modal LLMs. As of 2026, almost every open-source vision-language model uses CLIP-ViT or SigLIP-ViT as the frozen image backbone.

Early fusion (cross-attention, Flamingo style)¶

The defining architecture of "vision-language model" (VLM) circa 2022–2024.

Flamingo's structure¶

Start with a pretrained, frozen LLM (Chinchilla-70B in the paper) and a pretrained, frozen vision encoder (NFNet). Add new trainable layers that inject vision information into the LLM:

[image] → Vision encoder (frozen) → patch features (N_p × d_v)
                                       ↓
                            Perceiver Resampler (learnable, 64 outputs)
                                       ↓
                            64 fixed-size vision tokens
                                       │
                                       ├── injected as KV at each LLM layer via:
                                       ↓
[text tokens] → LLM layer with GATED CROSS-ATTENTION inserted before self-attention
                ↓
                ...repeat...
                ↓
                next-token logits

The two new components:

Perceiver Resampler. A small transformer that takes a variable number of patch features ($N_p \approx 200$ depending on the vision encoder) and produces a fixed 64 latent tokens. Why: the LLM is expensive per-token; you don't want 200 image tokens per image. 64 is enough for Flamingo's task suite.
Gated cross-attention layer. Inserted between every LLM transformer block. Reads from the 64 vision tokens, writes into the text residual stream.

The gating trick¶

Each cross-attention output is multiplied by tanh(α), where α is a learnable scalar initialized to 0. So at initialization, tanh(0) = 0 → the cross-attention output is 0 → the model behaves identically to the frozen LLM.

As training progresses, α learns to be nonzero where the cross-attention is useful. This is the "gated" in "gated cross-attention".

Why this matters:

Preserves base LLM capabilities. If the vision branch is poorly trained or absent at start, the model doesn't regress on text-only tasks. The frozen LLM is the floor.
Cheap to train. Only the resampler + gated cross-attention layers are trainable. For Flamingo-80B, this is ~10B trainable params on top of the 70B frozen LLM + 0.4B frozen vision encoder.
Recovers gracefully. If the cross-attention learns nothing useful, the gate stays near zero and the model is just an LLM. The failure mode is "no vision capability" rather than "broken model".

This gating trick is one of the deepest mechanical ideas in multi-modal model design. It's used in many follow-ups (IDEFICS, OpenFlamingo, MM-CoT).

Mechanically, why early fusion beats late fusion for VQA¶

Late fusion (CLIP) gives you one image embedding. To answer "how many dogs?", that embedding has to encode the count. It probably doesn't, because contrastive loss doesn't preferentially preserve count over color/pose/breed.

Early fusion gives the LLM access to all 64 vision tokens via cross-attention, with attention learned to focus on the relevant tokens for the current text token being generated. The model can attend to each dog token and "count" by attending in sequence. The capacity is qualitatively higher.

LLaVA: cheap early fusion¶

LLaVA (theory/04-llava-and-vision-language.md) is early fusion's cheaper cousin:

No gated cross-attention.
Instead: image features are projected (small MLP) into the LLM's embedding dim, and prepended to the text input as if they were extra tokens.
The LLM is fine-tuned (or LoRA-adapted) on instruction-tuning data.

Costs less to train than Flamingo (~$100 of GPU time for LLaVA-1.5), at the cost of treating image tokens uniformly across LLM layers (rather than letting each layer attend differently to image tokens).

Unified-token fusion¶

The "everything is a token" view: don't have a separate image encoder; instead, discretize the image into tokens (via VQ-VAE), put those tokens in the same vocabulary as text tokens, and train a single transformer with next-token prediction.

Chameleon (Meta, 2024)¶

The cleanest published example.

Vocabulary: 65,536 BPE tokens for text + 8192 image tokens (from a VQ-VAE) = ~73 k unified vocabulary.
Image tokenizer: a VQ-VAE that encodes a $512 \times 512$ image into a $32 \times 32 = 1024$ grid of discrete codes. Each code is an integer in $[0, 8191]$ — mapped to a learnable embedding identically to text tokens.
Training: next-token prediction on interleaved text + image documents. A sample might be: <text> <image_tokens × 1024> <text> <image_tokens × 1024>.
Result: a single decoder transformer that can generate both text and images, ask vision questions, and reason across modalities.

Gemini 1.5 / 2 (Google, 2023–2024)¶

Similar philosophy but extended to audio and video in the same way: discretize each modality into tokens, unified vocabulary, train one transformer. The closed-source nature means architectural detail is fuzzy, but Gemini 1.5's 1-M-token context window is built on the unified-token assumption.

Why unified-token is the dominant frontier paradigm (2025–2026)¶

Conceptual elegance. One model, one loss, one architecture. No special components per modality.
Generation. Can generate images, audio, text — same mechanism.
Cross-modal grounding. Stronger, because every layer sees every modality.
Scaling laws unified. The text-scaling laws translate to multi-modal — you can predict performance from compute + token count, period.

Why unified-token is expensive¶

Token explosion. A $512 \times 512$ image at $16 \times 16$ patches = 1024 image tokens. A 30-s audio clip discretized at 50 Hz with a 1024-codebook = 1500 tokens. Multi-modal context windows grow fast.
Pretraining data. You need interleaved multi-modal documents. Web-scraped text is plentiful; web-scraped (interleaved) multi-modal is rare and noisy. The data engineering is the bottleneck.
VQ-VAE training. The image tokenizer is itself a model that must be trained well; its codebook collapse is a known failure mode (codes get unused; effective vocabulary shrinks).
Compute. Gemini-class pretraining is ~$10^{25}$ FLOPs. Out of reach for non-frontier labs.

For an applied AI engineer (X2 is in the "Applied AI" hiring path): you will not pretrain a unified-token model. You will use a pretrained one (GPT-4o, Gemini, Claude, possibly Chameleon if open-weight). Understanding the paradigm is for picking the right model and knowing what it can and cannot do.

Training-data scaling for fusion¶

A frequently-asked question: how much data does each fusion strategy need?

Strategy	Pretraining data	Instruction-tune data	Total training cost
Late fusion (CLIP)	400M (image, text) pairs	N/A	~$10^{22}$ FLOPs
Early fusion w/ frozen encoder + frozen LLM (Flamingo)	reuse — 0	~1.5M interleaved samples	~$10^{22}$ FLOPs (mostly amortized in encoder + LLM)
Cheap early fusion (LLaVA)	reuse — 0	~600k pairs	~$10^{20}$ FLOPs
Unified-token (Chameleon, Gemini)	1B–10B multi-modal tokens (interleaved)	varies	$10^{24}$–$10^{25}$ FLOPs

Notes: - Late fusion needs the most paired data per param. The contrastive loss is data-hungry: every batch needs negatives, every pair needs a positive. - Early fusion with frozen components is the cheapest path to a working VLM. LLaVA-1.5 was reproduced for ~$100 of GPU time. - Unified-token needs more total data than the others, but most of it can be unpaired (the model trains on a mix of text-only, image-only, and interleaved). Data engineering > total parameter count.

Picking a strategy for a real project¶

This is the interview-relevant content.

Q: "I want to build a model that, given a photo, generates a description and answers follow-up questions."

A: Cheap early fusion (LLaVA-style). Pretrained CLIP encoder + pretrained LLaMA + projector MLP + instruction-tuning data. Works on a single GPU. Production-quality.

Q: "I want to build a model that, given a 1-minute video, answers questions about it."

A: Unified-token, using Gemini 1.5 / GPT-4o / Claude with vision via API. You will not train this. The infrastructure to train a 1-M-context unified-token model is frontier-lab-only.

Q: "I want zero-shot image retrieval over a 1M-image catalog."

A: Late fusion (CLIP or SigLIP, off the shelf). Encode each image once at ingest, store the embedding. Query is a text embedding + ANN search. Cheap and fast.

Q: "I want a multi-modal grammar tutor (Phase 32 + X2): user uploads a photo of a textbook sentence; tutor flags verb-form errors."

A: OCR pipeline + text LLM. Don't multi-modally fuse. Use a dedicated OCR model (Tesseract or pretrained TrOCR) to extract the sentence, then the Phase 32 text-only tutor. Modality fusion is overkill — the photo's information content is entirely textual.

(That last answer is the one I'd push on you in an interview. The temptation to use vision-language models everywhere is strong; the right answer is often "OCR + text".)

Summary¶

Late fusion (CLIP): two encoders + contrastive loss. Cheap inference. Cannot generate.
Early fusion (Flamingo / LLaVA): vision encoder + LLM + cross-attention or projection. Can generate. Cheap to train if reusing pretrained components. Gated cross-attention initialized to zero preserves base LLM.
Unified-token fusion (Chameleon / Gemini): one vocabulary, one transformer, one loss. Most powerful. Most expensive. Frontier-lab-only training.

Next: theory/04-llava-and-vision-language.md zooms into LLaVA — the most accessible production VLM architecture — in detail.

Strategy	Pretraining data	Instruction-tune data	Total training cost
Late fusion (CLIP)	400M (image, text) pairs	N/A	~\(10^{22}\) FLOPs
Early fusion w/ frozen encoder + frozen LLM (Flamingo)	reuse — 0	~1.5M interleaved samples	~\(10^{22}\) FLOPs (mostly amortized in encoder + LLM)
Cheap early fusion (LLaVA)	reuse — 0	~600k pairs	~\(10^{20}\) FLOPs
Unified-token (Chameleon, Gemini)	1B–10B multi-modal tokens (interleaved)	varies	\(10^{24}\)–\(10^{25}\) FLOPs