English · Español

01 — Vision Transformers: ViT, CLIP, DINOv2, SigLIP¶

🇪🇸 La idea central de ViT (Dosovitskiy 2020) es trivial pero impactante: una imagen es una secuencia de parches. Aplicas patchify + un transformer estándar (igual al de la Fase 17) y obtienes un encoder visual competitivo. CLIP (Radford 2021) añade la pieza que importa para X2: alinear ese encoder visual con un encoder de texto vía pérdida contrastiva simétrica.

This file derives ViT's patchify math from scratch, walks through CLIP's symmetric InfoNCE, and surveys DINOv2 / SigLIP improvements relevant to lab 00 and lab 01.

References: - Dosovitskiy et al., An Image is Worth 16×16 Words, ICLR 2021. (arXiv:2010.11929) - Radford et al., Learning Transferable Visual Models From Natural Language Supervision, ICML 2021. (arXiv:2103.00020) - Oquab et al., DINOv2, TMLR 2024. (arXiv:2304.07193) - Zhai et al., Sigmoid Loss for Language Image Pre-training (SigLIP), ICCV 2023. (arXiv:2303.15343)

ViT: patchify is the only new idea¶

The Vision Transformer (ViT) takes a standard transformer identical to the language transformer of Phase 17 and feeds it image patches instead of word tokens. The only architectural novelty is the embedding step: patchify.

Patchify math (memorize this)¶

Input: an RGB image of shape (H, W, C) = (224, 224, 3). We pick a patch size \(P = 16\).

Step 1. Reshape the image into non-overlapping \(P \times P \times C\) patches:

\[ H_p = H / P = 224 / 16 = 14, \quad W_p = W / P = 14 \]

so we get \(H_p \cdot W_p = 14 \cdot 14 = 196\) patches, each of size \((16, 16, 3)\).

Step 2. Flatten each patch to a vector of dimension \(P^2 \cdot C = 16 \cdot 16 \cdot 3 = 768\). Stack into a matrix X_patches ∈ R^{196 × 768}.

Step 3. Linear projection to embedding dim \(d_{model}\). For ViT-Base, \(d_{model} = 768\) so this projection is square. For ViT-Large, \(d_{model} = 1024\) so it's an upward projection.

\[ \mathbf{X}_{\text{tokens}} = \mathbf{X}_{\text{patches}} \mathbf{W}_E \quad \text{where} \quad \mathbf{W}_E \in \mathbb{R}^{768 \times d_{model}} \]

Step 4. Prepend the learnable [CLS] token. Now X ∈ R^{197 × d_model}. (The [CLS] token's final representation is what you use for classification — analogous to BERT.)

Step 5. Add learnable position embeddings P ∈ R^{197 × d_model}. Note these are learnable (not sinusoidal as in original transformer). The 1-D ordering corresponds to row-major raster scan over patches; ViT shows the model learns a 2-D structure anyway.

Step 6. Forward through N standard transformer encoder blocks (Phase 17, no causal mask). Take the final [CLS] representation.

Step 7. A classification head (linear → softmax over \(K\) classes) on the [CLS] representation.

Patchify as einsum¶

The patchify-reshape step is a single einsum (the implementation is one line of NumPy or PyTorch):

# x: (B, H, W, C); H, W must be divisible by P.
# Produces (B, H_p * W_p, P*P*C).
import numpy as np

def patchify(x: np.ndarray, P: int) -> np.ndarray:
    B, H, W, C = x.shape
    Hp, Wp = H // P, W // P
    # Rearrange: (B, Hp, P, Wp, P, C) -> (B, Hp*Wp, P*P*C)
    x = x.reshape(B, Hp, P, Wp, P, C)
    x = np.einsum("bhpwqc->bhwpqc", x)        # group patch coords last
    x = x.reshape(B, Hp * Wp, P * P * C)
    return x

This is mechanically all there is to "vision-as-a-transformer". The patchify view of an image is lossy in spatial resolution (a (16, 16, 3) patch is collapsed to a single token; no within-patch attention) but the receptive field within each transformer block sees the full image via attention from token 1, so global context is recovered immediately.

Parameter count for ViT-Base¶

ViT-Base/16 (the canonical config):

Patch embedding \(\mathbf{W}_E\): \(768 \times 768 = 589,824\) params (+768 bias).
[CLS] token: \(768\).
Position embedding: \(197 \times 768 = 151,296\).
12 transformer blocks. Each block (in Phase 17 form): \(4 \cdot d^2\) for QKV+O attention + \(8 \cdot d^2\) for MLP (4× expansion) = \(12 d^2 = 12 \cdot 768^2 = 7,077,888\) params (plus norms & biases, ~10k each).
Total: \(\approx 0.74 \text{M} + 12 \cdot 7.08 \text{M} \approx 86 \text{M}\) params.

Compare to the §A13 mini-GPT (Phase 17): \(\sim 103,680\) params. ViT-Base is ~830× larger.

For lab 00 we use a tiny ViT: \(d = 64\), 4 blocks, image \(32 \times 32\), patch \(4 \times 4\). Token count = \(64\) patches + 1 [CLS] = \(65\). Total params ~\(200\) k, CPU-trainable.

Why ViT beats CNNs (eventually)¶

The original ViT paper's headline: ViT matches ResNet on ImageNet only when pretrained on JFT-300M (a 300 M-image proprietary dataset). On ImageNet-1k alone, ViT loses to ResNet.

The mechanistic reason: CNNs encode translation equivariance and local-receptive-field bias in their architecture. ViT has neither — it has to learn both from data. Given enough data (≥ 100 M images), it does, and then surpasses CNN because it can route information globally from layer 1.

For the §A13 grammar-icon task in lab 00 the icons are synthetic and small (\(32 \times 32\), 5 tense × 3 person = 15 classes with 60 examples each). ViT will work — the task is easy — but the lesson is mechanistic (patchify + transformer), not performance.

CLIP: contrastive image-text training¶

CLIP (Contrastive Language-Image Pretraining) is the paper that made image-text alignment a commodity. It is two encoders + one loss:

Image encoder. A ViT or ResNet. Outputs an image embedding \(\mathbf{z}_I \in \mathbb{R}^d\) (typically the [CLS] token, L2-normalized).
Text encoder. A transformer (autoregressive on text). Outputs a text embedding \(\mathbf{z}_T \in \mathbb{R}^d\) from the [EOS] token's representation, L2-normalized.
Symmetric InfoNCE loss. Given a batch of \(N\) (image, text) pairs, the model must pick the correct text for each image and the correct image for each text.

CLIP loss in code¶

Given a batch of \(N\) pairs, \(\mathbf{Z}_I \in \mathbb{R}^{N \times d}\) and \(\mathbf{Z}_T \in \mathbb{R}^{N \times d}\) (both row-L2-normalized):

import numpy as np

def clip_loss(Z_I: np.ndarray, Z_T: np.ndarray, tau: float = 0.07) -> float:
    # Z_I, Z_T: (N, d), each row L2-normalized.
    N = Z_I.shape[0]
    # Cosine-similarity matrix, scaled by 1/tau. Shape (N, N).
    logits = Z_I @ Z_T.T / tau
    # Ground truth: diagonal is correct match.
    labels = np.arange(N)
    # Symmetric: image-as-query AND text-as-query.
    loss_i2t = cross_entropy(logits, labels)         # rows
    loss_t2i = cross_entropy(logits.T, labels)       # cols
    return (loss_i2t + loss_t2i) / 2

The two terms are symmetric — the same matrix is used both row-wise and column-wise. The temperature \(\tau\) (CLIP uses a learnable parameter, initialized to \(0.07\)) scales the logits before softmax. Lower \(\tau\) → harder softmax → more contrastive pressure.

Why symmetric matters¶

Without symmetry, the model can collapse: all text embeddings map to a single point, all image embeddings map to a single point near it, and the diagonal of \(\mathbf{Z}_I \mathbf{Z}_T^\top\) is high while the off-diagonal is low but only along one direction. The symmetric loss forces the off-diagonal to be low in both directions, which is what prevents collapse.

CLIP's actual training scale¶

400 M (image, text) pairs from public internet (the WIT dataset, internal).
Image encoder: ViT-L/14 (~300 M params).
Text encoder: transformer (~63 M params, vocab 49k, context 77 tokens).
Batch size: 32,768 (= \(N\) in the loss above). The contrastive loss is fundamentally batch-size-bound — larger batch = more negatives = harder, more informative loss.
32 V100 GPUs × 12 days. ~\(10^{22}\) FLOPs.

The batch-size requirement is the engineering reason CLIP is hard to reproduce on small hardware. Lab 01 uses \(N = 32\) in a single CPU batch — qualitatively the same loss but with much weaker contrastive pressure. We make up for it with a small, low-diversity dataset (20 verbs × tense icons; the negatives are easy because the task is small).

CLIP's downstream uses¶

Zero-shot classification. "Classify image \(x\) over \(K\) classes whose names you have as text." Embed each class name, take the closest in cosine. SOTA on many benchmarks without any training on the downstream task.
Retrieval. Image → text and text → image. Production search engines (Pinterest, Unsplash) use CLIP-flavor embeddings.
Feature extractor for downstream multi-modal models. LLaVA's vision encoder is CLIP-ViT-L/14 frozen. (See theory/04-llava-and-vision-language.md.)

DINOv2: self-supervised vision pretraining without text¶

DINOv2 (Oquab et al. 2024, Meta) trains a ViT without any text labels using a self-distillation contrastive loss between two augmented views of the same image. The output is a vision encoder that:

Beats CLIP-ViT on dense prediction tasks (segmentation, depth) at equal parameter count.
Matches CLIP-ViT on classification/retrieval.
Has cleaner features for downstream LLM fusion — DINOv2 features are increasingly used as the backbone in LLaVA-class models (e.g. some Llama-3-Vision variants).

Why DINOv2 matters for X2: DINOv2 features are what you'd use today (2026) if building a new multi-modal LLM from scratch and wanted the strongest frozen vision encoder. CLIP is what was used in 2022–2024.

The training trick (relevant to Phase 18 / training dynamics): a teacher ViT (EMA of the student weights) and a student ViT. Both see different crops of the same image. The student is trained to predict the teacher's [CLS] representation. There's also a patch-level objective via masked image modeling. The combined objective produces features that excel at both global (image-level) and local (patch-level) tasks.

We do not implement DINOv2 in the labs. Treat it as the reference point for "what's the current SOTA vision backbone for multi-modal models in 2026".

SigLIP: sigmoid loss as a CLIP improvement¶

SigLIP (Zhai et al. 2023) is one change to CLIP: replace softmax-InfoNCE with per-pair binary cross-entropy (sigmoid).

Concretely, instead of:

\[ \mathcal{L}_{i2t} = -\log \frac{\exp(z_i \cdot z_i^T / \tau)}{\sum_j \exp(z_i \cdot z_j^T / \tau)} \]

SigLIP uses:

\[ \mathcal{L}_{\text{sigmoid}} = -\sum_{i, j} \left[ y_{ij} \log \sigma(z_i \cdot z_j^T / \tau + b) + (1 - y_{ij}) \log (1 - \sigma(z_i \cdot z_j^T / \tau + b)) \right] \]

where \(y_{ij} = 1\) if \((i, j)\) is a true pair and \(0\) otherwise, and \(b\) is a learnable bias initialized negative (so off-diagonal pairs start near \(\sigma(\text{negative}) \approx 0\), matching the prior that random pairs don't match).

Why this is better:

Memory. The softmax denominator scales with batch size \(N\) (each row sums over all \(N\) columns). Sigmoid is fully decoupled — each \((i, j)\) pair is independent. You can shard the pair matrix across devices without an all-gather.
No batch-size requirement. SigLIP works well at small batch sizes (≤ 1k). CLIP needs batch ≥ 16k to be competitive.
Slightly better quality. SigLIP-B/16 outperforms CLIP-B/16 on most benchmarks by 1–2 points at equal compute.

Why we still teach CLIP first: the symmetric softmax is conceptually cleaner (it's a probability distribution; ties to Phase 18 cross-entropy directly), and it's still the dominant baseline. SigLIP is the production-grade replacement.

For lab 01 we implement CLIP's symmetric InfoNCE. SigLIP is a one-line change you can do as a stretch goal.

Summary of the visual side¶

A vision transformer is the Phase 17 transformer block + patchify + [CLS] + 2-D position embedding. Nothing else.
Patchify: (224, 224, 3) → 14 × 14 × (16, 16, 3) → 196 tokens of 768 dim. One einsum.
CLIP: two encoders, contrastive symmetric-softmax loss. Batch-size-bound at scale.
DINOv2: self-supervised; produces stronger features for downstream multi-modal fusion. Reference point, not implemented.
SigLIP: drop-in CLIP replacement with sigmoid loss. Less memory, no batch-size dependence. Production today.

Next: theory/02-audio-models.md covers the audio side — the modality that doesn't fit into a "tokens with position embeddings" frame as cleanly as images do.