English · Español

04 — LLaVA and the Vision-Language Eval Landscape¶

🇪🇸 LLaVA (Liu et al. 2023) es el modelo abierto más influyente de la familia visión-lenguaje. La receta cabe en una página: encoder de imagen pre-entrenado (CLIP) + LLM pre-entrenado (Llama / Vicuna) + un proyector pequeño (MLP) que mapea features de CLIP al espacio del LLM, entrenado en dos fases. Esta sección cubre la receta, el truco del proyector, y el paisaje de benchmarks (MM-Bench, MME, MMMU).

This file answers: how does LLaVA work mechanically, and how do you evaluate a multi-modal model?

References: - Liu et al., Visual Instruction Tuning (LLaVA), NeurIPS 2023. (arXiv:2304.08485) - Liu et al., Improved Baselines with Visual Instruction Tuning (LLaVA-1.5), CVPR 2024. (arXiv:2310.03744) - Liu et al., MM-Bench, ECCV 2024. (arXiv:2307.06281) - Fu et al., MME: A Comprehensive Evaluation Benchmark for Multimodal LLMs, 2023. (arXiv:2306.13394) - Yue et al., MMMU, CVPR 2024. (arXiv:2311.16502)

The LLaVA recipe¶

LLaVA is deliberately simple. The architecture is:

[image] → CLIP-ViT-L/14 (frozen, 300M params) → 256 patch features × 1024 dim
                                                      ↓
                                          PROJECTOR (trainable, ~20M params)
                                          Two-layer MLP: 1024 → 4096 → 4096
                                                      ↓
                                          256 image tokens × 4096 dim
                                                      │
[text prompt] ─────────────────────────────────────┐  │
                                                    ↓ ↓
                                          Vicuna-7B / 13B (LLM)
                                          Concat: [image tokens] [text tokens]
                                                      ↓
                                          standard causal LM
                                                      ↓
                                          generated response

Three components, with three different roles:

CLIP-ViT-L/14 (frozen). Provides the image representation. Specifically: the penultimate layer features (not the post-projection [CLS] that CLIP's loss optimized) — there are 256 patch tokens at \(224/14 = 16\) × \(16\) resolution. 1024-dim each.
The projector (trainable, key innovation). A small MLP that maps each 1024-dim CLIP patch feature to a 4096-dim "LLM token". 256 such tokens are produced per image.
Vicuna-7B (LLM). A pretrained, instruction-tuned LLaMA. In LLaVA stage 2 it gets fully fine-tuned (or in LLaVA-1.5, LoRA-adapted) on multi-modal instruction data.

Why the projector is the cheap-but-not-trivial step¶

Naive question: why not just use CLIP's existing 1024-dim text-aligned embedding directly?

Answer: because CLIP's embedding is one vector per image. The LLM needs many tokens to attend to spatially. The projector takes the 256 patch features (which retain spatial structure) and maps them into the LLM's embedding dim, so each image becomes 256 "tokens" with spatial grounding.

The projector is:

class Projector(nn.Module):
    def __init__(self, d_clip=1024, d_llm=4096):
        super().__init__()
        self.fc1 = nn.Linear(d_clip, d_llm)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(d_llm, d_llm)

    def forward(self, x):  # x: (B, 256, 1024)
        return self.fc2(self.act(self.fc1(x)))  # (B, 256, 4096)

Param count: \(1024 \cdot 4096 + 4096 \cdot 4096 \approx 21\) M. 0.3% of the 7B-LLM's params.

Why this is the key step:

It bridges the modality gap (theory/00-motivation.md) — CLIP features live in one cone of \(\mathbb{R}^d\); LLM input embeddings live in another. The projector learns the transformation.
It's the only trainable component in LLaVA stage 1 (everything else is frozen). The model learns multi-modal grounding purely through the projector.
It can be a single linear layer (LLaVA-1.0) or a 2-layer MLP (LLaVA-1.5). The MLP version is meaningfully better — non-linearity matters here, because the modality-bridging transformation is not linear.

Two-stage training¶

Stage 1 — Feature alignment pretraining.

Trainable: projector only. LLM and vision encoder both frozen.
Data: 558k (image, caption) pairs from LAION/CC-3M, filtered to short captions.
Task: given image, generate the caption. Standard cross-entropy.
Compute: ~4 GPU-hours on a single A100.

The goal of stage 1 is just to make the projector output tokens that the LLM can use. The LLM doesn't yet know vision tasks; it just learns to "describe what the image tokens say".

Stage 2 — Visual instruction tuning.

Trainable: projector + LLM (full fine-tuning in LLaVA-1.0; LoRA in LLaVA-1.5). Vision encoder still frozen.
Data: 158k (LLaVA-1.0) or ~600k (LLaVA-1.5) instruction pairs. Generated by feeding image captions + bounding boxes to GPT-4 (text-only at the time) and asking it to generate multi-turn vision-language dialogues. Synthetic but high-quality.
Task: given image + question, generate answer.

The synthetic-instruction data is the insight that made LLaVA work. Real human-annotated VLM dialogue data is scarce; GPT-4 can produce arbitrary amounts.

LLaVA-1.5 deltas¶

LLaVA-1.5 (Oct 2023) was a set of small fixes that took LLaVA from "research demo" to "competitive with GPT-4V on many benchmarks":

MLP projector instead of single linear (+1.5 points on most benchmarks).
Higher resolution images. \(336 \times 336\) instead of \(224 \times 224\), giving 576 image tokens instead of 256.
Better instruction data. Added academic-task data (OCR, charts, etc.).
LoRA for the LLM fine-tune.

LLaVA-1.5-13B beats LLaVA-1.0-13B by ~10 points on MM-Bench despite training on ~\(1/10\) the compute.

Evaluation landscape¶

You can't ship a VLM without evals. The evals don't correlate perfectly — a model can win one and lose another. Knowing what each measures is interview-relevant.

MM-Bench (Liu et al. 2024)¶

What it measures: perception and reasoning across 20 ability dimensions (object localization, action recognition, OCR, fine-grained classification, spatial reasoning, etc.).
Format: ~3000 multiple-choice questions, each tied to one ability dimension.
Scoring: accuracy per dimension + average. The per-dimension breakdown is what makes MM-Bench useful diagnostically — you see which perceptual ability the model lacks.
Strength: the multiple-choice format is automatic to grade. The per-ability breakdown is actionable.
Weakness: multiple-choice can be gamed by language priors (the model can sometimes pick the right answer without looking at the image). MM-Bench mitigates with a "circular eval" that shuffles answer order.
Score range typical: SOTA open VLMs (LLaVA-1.5-13B, IDEFICS-80B) score ~60–70 average. GPT-4V scores ~75. LLaVA-1.5-7B scores ~60.

MME (Fu et al. 2023)¶

What it measures: 14 perception tasks (existence, count, position, OCR, etc.) + cognition (commonsense reasoning, calculation, code reasoning).
Format: ~2000 yes/no questions. Each image has 2 questions: one factually true, one false. Scoring requires both to be correct.
Scoring: Perception (max 2000) + Cognition (max 800) reported separately.
Strength: yes/no is robust to language priors. The paired-question format eliminates the "model always says yes" failure mode.
Weakness: binary questions can't measure nuance. A model that always correctly identifies "is there a cat?" but cannot describe the cat would score well.
Score range typical: GPT-4V scores Perception ~1500, Cognition ~500. LLaVA-1.5-13B scores ~1530, ~300.

MMMU (Yue et al. 2024)¶

What it measures: college-level multimodal reasoning across 30 subjects (Art, Business, Science, Health, Humanities, Tech & Engineering). Questions taken from college exams, textbooks, and quizzes.
Format: ~11,500 multiple-choice questions. Images include diagrams, charts, formulas, illustrations.
Scoring: accuracy. Reported overall and per-subject.
Strength: measures reasoning, not just perception. The image is critical — without it, accuracy drops to chance. Many questions require domain knowledge + visual interpretation.
Weakness: very hard. Even GPT-4V scores only ~55%. Most open VLMs score 30–40%. This is partly the gap that frontier-lab pretraining gives you over open-source.
Why this matters in 2026: MMMU is the canonical "is your model actually intelligent multi-modally" benchmark. MM-Bench tells you about perception; MMMU tells you about reasoning. Top closed models (GPT-4.5, Claude 4, Gemini 2) are pushing this to ~70%; open models still lag at ~50%.

A heuristic for interpreting scores¶

If your model scores…	…it can probably…	…but it cannot…
MM-Bench < 50	Identify objects	Count, locate, or reason spatially
MM-Bench 50–65	Identify and locate	Multi-step reasoning, OCR on noisy docs
MM-Bench > 70	Most perception tasks	College-level reasoning
MMMU < 40	Basic VQA	Domain-grounded reasoning
MMMU 40–55	Some scientific reasoning	Math-heavy multi-modal problems
MMMU > 55	Frontier-lab class	(frontier territory, 2026)

Benchmarks LLaVA-1.5 was tuned for (and against)¶

LLaVA-1.5 was tuned heavily on MM-Bench-style data → it scores well there.
It was not tuned for MMMU → it underperforms relative to closed models.

This is a common pattern. The benchmark a model scores best on is often the one it was trained to optimize. When evaluating a VLM for a real use case, run your eval on your data — published numbers are a starting point, not a final answer.

How LLaVA connects to the grammar tutor (Phase 32)¶

The §A13 grammar tutor's natural multi-modal extension (from theory/00-motivation.md): user uploads a photo of a sentence, tutor flags verb-form errors. Using LLaVA-style architecture, this would be:

[photo of sentence] → CLIP-ViT (frozen) → 256 patch features
                                             ↓
                                  Projector (trainable, ~20M params)
                                             ↓
                                  256 image tokens
                                             ↓
                                  Vicuna-7B (or smaller — Phi-3, Llama-3-1B-Instruct)
                                  System prompt: "You are an English grammar tutor.
                                  Look at the photo. Find the verb. Identify its tense.
                                  Check that subject and verb agree. Output the
                                  corrected sentence + explanation."
                                             ↓
                                  Generated response

The right answer is actually not to do this — as theory/03-fusion-strategies.md argues, OCR + text-only LLM is the smarter pipeline. But if you wanted to demo a multi-modal capstone, this is the recipe.

The grammar-tutor instruction data for stage 2 would be: take §A13's 20 verbs × 5 tenses × 3 persons = 300 correct sentences and 300 incorrect sentences (with planted errors), render each as a photo of a textbook page (using Pillow), pair with the corrected version + explanation. ~600 instruction pairs. Stage-2-trainable on a CPU? No (too slow). Single GPU-day? Yes.

This is the path X2 prepares for. The labs don't go there (one extension is enough); but the recipe is unambiguous.

Summary¶

LLaVA recipe: frozen CLIP + projector MLP + LLM, two-stage training (alignment pretraining + visual instruction tuning).
The projector is the cheap-but-not-trivial step. ~20M params bridge a ~300M vision encoder and a ~7B LLM. Non-linearity matters.
Eval landscape: MM-Bench (perception, per-ability) + MME (yes/no perception + cognition) + MMMU (college-level reasoning). Don't pick a VLM based on a single score.
For the grammar tutor: the multi-modal extension exists as a recipe, but text-only + OCR is the engineering-right answer.

Next: lab/00-tiny-vit-on-grammar-icons.md — implement a 4-block ViT on top of the Phase 17 transformer block and train it on synthetic icons.