English · Español
Lab 01 — CLIP-style Grammar Image-Text Pairing¶
Goal: build a minimal CLIP-style contrastive trainer. Reuse the lab 00 ViT (without classification head) as the image encoder; build a tiny text encoder (Phase 17 transformer in encoder mode) for verb-form text snippets; train with symmetric InfoNCE on (icon, text) pairs; benchmark top-k retrieval.
🇪🇸 Implementamos el bucle de entrenamiento contrastivo de CLIP sobre nuestro dominio de gramática. La métrica clave: dada una imagen de ícono, ¿se recupera el snippet de texto correcto en top-k? Y viceversa. Esto verifica que las dos modalidades se alinean en el mismo espacio de embedding.
Estimated time: 4–5 hours.
Prereqs: - Lab 00 complete (you have a working ViT). -
docs/extension-track/X2-multimodal/theory/01-vision-transformers.md§"CLIP" understood — specifically, the symmetric InfoNCE loss.
What you produce¶
A directory experiments/X2-multimodal/lab-01-clip-grammar/ containing:
BLUEPRINT.mddataset.py— pairs each icon from lab 00 with a text snippet (the verb form it represents).image_encoder.py— lab 00's ViT, classification head removed; outputs the [CLS] representation (64-dim).text_encoder.py— tiny transformer over BPE/word tokens of the verb-form snippet; outputs the [EOS] representation (64-dim).clip_model.py— wraps both encoders + L2-normalize + temperature scalar.loss.py—symmetric_infonce.train.py.eval.py— top-1, top-5 retrieval in both directions.manifest.json.README.md.loss.png,retrieval_curves.png,confusion_matrix_top1.png.
The dataset¶
We reuse the 1000 icons from lab 00. For each icon, we generate a small set of equivalent text snippets describing the same (verb, tense, person) triple. Examples for the icon encoding (verb=work, tense=present_simple, person=3sg):
For each icon, sample 1 text snippet per training step (data augmentation). Total pairs per epoch ≈ 1000.
Text vocabulary¶
Don't use BPE here — overkill. Build a small word-level vocabulary from the corpus:
- 20 verbs (
work, play, walk, ...) → 20 tokens. - Inflected forms (
works, worked, working, ...) → ~80 tokens (4-5 forms × 20 verbs). - Pronouns:
I, you, he, she, it→ 5 tokens. - Tense words:
present, past, simple, participle, future→ 5 tokens. - Auxiliary:
will, going, to→ 3 tokens. - Glue:
of, the, person, singular, 3rd, 2nd, 1st→ 7 tokens. - Special:
<BOS>, <EOS>, <PAD>, <UNK>→ 4 tokens.
Total: ~124 tokens. Tiny vocabulary, tractable embedding table.
Tokenize snippets: lowercase, split on whitespace, map each word to its vocab ID. Pad to length 8. Add <BOS> and <EOS>.
Why these snippets¶
- Multiple snippets per (verb, tense, person) → augmentation. Model can't memorize "icon X = snippet Y"; it must learn the structure (color → tense, shape → person, pattern → verb).
- Snippet variety mimics CLIP's real-world setting where captions vary in style.
Architecture: the CLIP model¶
┌───────── image branch ─────────┐ ┌──────── text branch ────────┐
icon (32, 32, 3) tokens (T,) — e.g. 8 tokens
↓ patchify (P=4) ↓ embedding lookup
patches (64, 48) token emb (8, 64)
↓ linear proj ↓ + learned pos emb
patch tokens (64, 64) embedded (8, 64)
↓ prepend [CLS] ↓ × 2 transformer blocks
tokens (65, 64) output (8, 64)
↓ + pos emb ↓ take [EOS] (last position)
↓ × 4 transformer blocks z_T (64,)
output (65, 64) ↓ proj_T (Linear 64 → 64)
↓ take [CLS] z_T (64,)
z_I (64,) ↓ L2 normalize
↓ proj_I (Linear 64 → 64) z_T_norm (64,)
z_I (64,)
↓ L2 normalize
z_I_norm (64,)
└─────────────────────────────┘ └────────────────────────────┘
↓ ↓
logits = z_I_norm @ z_T_norm.T / tau (N × N)
↓
Symmetric InfoNCE — cross-entropy on diagonal, both directions
Note the L2-normalization + temperature: this is what makes the dot product a cosine similarity and gives the InfoNCE loss well-defined gradients.
Param counts¶
| Component | Params (approx) |
|---|---|
| Image encoder (lab 00 ViT) | 150k |
| Image projection head | 4k |
| Text token embedding (124 × 64) | 8k |
| Text position embedding (8 × 64) | 512 |
| Text transformer (2 blocks) | 60k |
| Text projection head | 4k |
| Temperature τ (scalar) | 1 |
| Total | ~230k |
CPU-trainable. ~1 minute per epoch on the i5-8250U at batch size 32.
The loss: symmetric InfoNCE¶
The full implementation in NumPy (or your Phase 08 autograd, depending on where the curriculum is):
def symmetric_infonce(z_I, z_T, tau):
"""
z_I, z_T: (N, d), each row L2-normalized.
tau: float, temperature.
Returns scalar loss.
"""
N = z_I.shape[0]
logits = z_I @ z_T.T / tau # (N, N)
labels = np.arange(N)
# cross_entropy reduces with mean
loss_i2t = cross_entropy(logits, labels) # i-th row → label i
loss_t2i = cross_entropy(logits.T, labels) # i-th col → label i
return (loss_i2t + loss_t2i) / 2
Two interpretations of the same matrix:
- Row \(i\): which text matches image \(i\)? Softmax over all \(N\) texts in the batch. Ground truth is index \(i\).
- Column \(j\): which image matches text \(j\)? Softmax over all \(N\) images in the batch. Ground truth is index \(j\).
Symmetric loss = average of both.
Why batch size matters¶
The "negatives" for image \(i\) are all texts \(j \ne i\) in the batch. With batch 32, you have 31 negatives. CLIP-the-paper used batch 32,768 → 32,767 negatives.
For our toy problem (small dataset, easy task), 31 negatives is enough. Do not try to make this lab match CLIP scale.
The temperature¶
CLIP uses a learnable temperature parameter, initialized to \(1/0.07 \approx 14.29\) (so logits are scaled up by ~14, then softmax produces sharper distributions). The temperature is learnable but clamped in CLIP — it can't exceed \(\exp(4.6) \approx 100\) to prevent the loss from diverging.
For our toy problem, fix \(\tau = 0.1\) (so \(1/\tau = 10\)) and don't make it learnable. One less hyperparameter to debug.
TODOs¶
Block A — design¶
-
BLUEPRINT.mdfirst, before code. Include: - Architecture diagram with shapes.
- Param count to within 10% of estimate.
- Train/test split policy. Critical: hold out 4 of the 20 verbs entirely from training, so test set has unseen verbs. This forces the model to learn compositional structure (color = tense, shape = person, pattern = verb) rather than memorize.
- Hyperparameters: batch size 32, LR 1e-3 with cosine, n_epochs 30, tau 0.1.
- Test plan: top-1 and top-5 retrieval in both directions, on held-out verbs.
Pause for approval.
Block B — implement encoders¶
-
image_encoder.py. Subclass / wrap your lab 00 ViT. Drop the classification head; exposeforward(image) → z_I (B, d). -
text_encoder.py. New tiny transformer (2 blocks, d_model=64, 4 heads). Embedding + position + blocks + take [EOS]. Reuse Phase 17 transformer block.
Block C — CLIP model + loss¶
-
clip_model.py. Wraps both encoders + projection heads + L2-normalize. -
loss.py.symmetric_infoncewithcross_entropyhelper. - Unit-test the loss:
- With \(z_I = z_T\) (perfect alignment), batch 4, \(\tau = 0.1\): loss should equal \(-\log(\text{softmax-of-diag})\) ≈ 0. (Not exactly 0 because the diagonal isn't \(+\infty\); but very small.)
- With \(z_I\) random and \(z_T\) random, batch 4: loss should be around \(\log(4) \approx 1.39\) (uniform over 4 classes).
Block D — training loop¶
-
train.py. Standard loop. Log loss + retrieval-accuracy-on-batch each step. - Save loss curve as
loss.png.
Block E — eval¶
-
eval.py. For each held-out icon, compute its embedding + embed all held-out text snippets. Rank by cosine; report top-1 and top-5 accuracy. - Repeat in text → image direction.
- Acceptance:
- Top-1 retrieval (image → text) ≥ 50% on held-out verbs. (Random would be ~5%.)
- Top-5 retrieval ≥ 80%.
- If you hit these, you've verified the contrastive mechanism works on the grammar domain.
- Save
retrieval_curves.png(top-k vs k) andconfusion_matrix_top1.png(predicted vs true class).
Block F — stretch goals¶
- Replace symmetric softmax InfoNCE with SigLIP-style sigmoid loss. Compare convergence speed and final retrieval at batch 32. SigLIP should win at this small batch size.
- Probe the modality gap from
theory/00-motivation.md: compute centroid of all \(z_I\) embeddings, centroid of all \(z_T\) embeddings, measure cosine distance. Report. Compare before and after training. (Expectation: at init it's near 0.5–1.0; after training it's still > 0 but smaller. Confirms that the modality gap is real on a toy problem too.) - Visualize the learned embeddings via PCA-to-2D. Save
embeddings_pca.png. Color image embeddings blue, text red. Do they form two clusters (modality gap) or interleave (no gap)?
Acceptance criteria¶
BLUEPRINT.mdapproved.- Image encoder reused from lab 00 (no copy-paste).
- Text encoder reuses Phase 17 transformer block.
symmetric_infoncematches the spec; unit tests pass.- Top-1 retrieval ≥ 50%, top-5 ≥ 80% on held-out verbs.
- Training ≤ 30 minutes wall-clock on CPU.
README.md,manifest.json, all artifacts committed.
What this lab is intentionally NOT¶
- Not a CLIP reproduction. Real CLIP needs orders of magnitude more data and batch size. We verify the mechanism.
- Not zero-shot ImageNet classification. That requires a model that's seen real-world images, which we don't.
- Not a comparison with non-contrastive losses (e.g. caption-generation cross-entropy). Could be a stretch goal.
Debugging hints¶
If top-1 retrieval is at chance (~5%) after several epochs:
- Check L2-normalization. After normalize,
z_I.norm(axis=1)should be all 1. If not, you're not on a unit sphere; the cosine interpretation fails. - Check loss decreasing. If flat, the gradient might not flow. Print the gradient norm of each encoder's first layer. Should be similar magnitude. If image grad norm is 100× text grad norm, you have a scale problem — likely the L2-normalize isn't differentiable / you forgot to backprop through it.
- Check the diagonal. Print
logitsbefore softmax for a small batch. The diagonal should grow more positive over training; off-diagonal should grow more negative. - Held-out verbs are too hard. Try the eval first on in-distribution test split (icons from training verbs but unseen samples). If you hit > 80% there but fail on held-out, the model is memorizing the patterns of training verbs and not generalizing. That's still a valid finding — note it in the report.
If the modality gap (stretch goal) is huge after training (> 0.5 cosine distance between centroids):
- Try training longer.
- Try a smaller \(\tau\) (sharper softmax, more contrastive pressure).
- Try shared projection (the same linear layer projects both modalities). Some CLIP variants do this; it forces alignment in the projection space.
What you'll have learned¶
- Symmetric InfoNCE in practice: the symmetric softmax over a similarity matrix.
- The batch size dependence of contrastive loss: more negatives = harder loss = better representations. At small batch, contrastive learning is weak.
- The modality gap as a measurable phenomenon, not just a paper claim.
- That building a "CLIP-style model" mechanically is easy — three components (encoder, encoder, contrastive loss). The hard part is the data and the scale.
Next lab: lab/02-whisper-inference-walkthrough.md switches to audio — load Whisper-tiny, run inference on prerecorded clips of English verb forms, inspect the internal attention and timestamp logits.