Skip to content

English · Español

Lab 01 — CLIP-style Grammar Image-Text Pairing

Goal: build a minimal CLIP-style contrastive trainer. Reuse the lab 00 ViT (without classification head) as the image encoder; build a tiny text encoder (Phase 17 transformer in encoder mode) for verb-form text snippets; train with symmetric InfoNCE on (icon, text) pairs; benchmark top-k retrieval.

🇪🇸 Implementamos el bucle de entrenamiento contrastivo de CLIP sobre nuestro dominio de gramática. La métrica clave: dada una imagen de ícono, ¿se recupera el snippet de texto correcto en top-k? Y viceversa. Esto verifica que las dos modalidades se alinean en el mismo espacio de embedding.

Estimated time: 4–5 hours.

Prereqs: - Lab 00 complete (you have a working ViT). - docs/extension-track/X2-multimodal/theory/01-vision-transformers.md §"CLIP" understood — specifically, the symmetric InfoNCE loss.


What you produce

A directory experiments/X2-multimodal/lab-01-clip-grammar/ containing:

  • BLUEPRINT.md
  • dataset.py — pairs each icon from lab 00 with a text snippet (the verb form it represents).
  • image_encoder.py — lab 00's ViT, classification head removed; outputs the [CLS] representation (64-dim).
  • text_encoder.py — tiny transformer over BPE/word tokens of the verb-form snippet; outputs the [EOS] representation (64-dim).
  • clip_model.py — wraps both encoders + L2-normalize + temperature scalar.
  • loss.pysymmetric_infonce.
  • train.py.
  • eval.py — top-1, top-5 retrieval in both directions.
  • manifest.json.
  • README.md.
  • loss.png, retrieval_curves.png, confusion_matrix_top1.png.

The dataset

We reuse the 1000 icons from lab 00. For each icon, we generate a small set of equivalent text snippets describing the same (verb, tense, person) triple. Examples for the icon encoding (verb=work, tense=present_simple, person=3sg):

"he works"
"she works"
"it works"
"3rd person singular present simple of work"
"works"

For each icon, sample 1 text snippet per training step (data augmentation). Total pairs per epoch ≈ 1000.

Text vocabulary

Don't use BPE here — overkill. Build a small word-level vocabulary from the corpus:

  • 20 verbs (work, play, walk, ...) → 20 tokens.
  • Inflected forms (works, worked, working, ...) → ~80 tokens (4-5 forms × 20 verbs).
  • Pronouns: I, you, he, she, it → 5 tokens.
  • Tense words: present, past, simple, participle, future → 5 tokens.
  • Auxiliary: will, going, to → 3 tokens.
  • Glue: of, the, person, singular, 3rd, 2nd, 1st → 7 tokens.
  • Special: <BOS>, <EOS>, <PAD>, <UNK> → 4 tokens.

Total: ~124 tokens. Tiny vocabulary, tractable embedding table.

Tokenize snippets: lowercase, split on whitespace, map each word to its vocab ID. Pad to length 8. Add <BOS> and <EOS>.

Why these snippets

  • Multiple snippets per (verb, tense, person) → augmentation. Model can't memorize "icon X = snippet Y"; it must learn the structure (color → tense, shape → person, pattern → verb).
  • Snippet variety mimics CLIP's real-world setting where captions vary in style.

Architecture: the CLIP model

┌───────── image branch ─────────┐    ┌──────── text branch ────────┐
icon (32, 32, 3)                       tokens (T,) — e.g. 8 tokens
  ↓ patchify (P=4)                       ↓ embedding lookup
patches (64, 48)                         token emb (8, 64)
  ↓ linear proj                          ↓ + learned pos emb
patch tokens (64, 64)                  embedded (8, 64)
  ↓ prepend [CLS]                        ↓ × 2 transformer blocks
tokens (65, 64)                        output (8, 64)
  ↓ + pos emb                            ↓ take [EOS] (last position)
  ↓ × 4 transformer blocks                z_T (64,)
output (65, 64)                          ↓ proj_T (Linear 64 → 64)
  ↓ take [CLS]                          z_T (64,)
z_I (64,)                                ↓ L2 normalize
  ↓ proj_I (Linear 64 → 64)            z_T_norm (64,)
z_I (64,)
  ↓ L2 normalize
z_I_norm (64,)
└─────────────────────────────┘    └────────────────────────────┘
                  ↓                                 ↓
                   logits = z_I_norm @ z_T_norm.T / tau    (N × N)
                  Symmetric InfoNCE — cross-entropy on diagonal, both directions

Note the L2-normalization + temperature: this is what makes the dot product a cosine similarity and gives the InfoNCE loss well-defined gradients.

Param counts

Component Params (approx)
Image encoder (lab 00 ViT) 150k
Image projection head 4k
Text token embedding (124 × 64) 8k
Text position embedding (8 × 64) 512
Text transformer (2 blocks) 60k
Text projection head 4k
Temperature τ (scalar) 1
Total ~230k

CPU-trainable. ~1 minute per epoch on the i5-8250U at batch size 32.


The loss: symmetric InfoNCE

The full implementation in NumPy (or your Phase 08 autograd, depending on where the curriculum is):

def symmetric_infonce(z_I, z_T, tau):
    """
    z_I, z_T: (N, d), each row L2-normalized.
    tau: float, temperature.
    Returns scalar loss.
    """
    N = z_I.shape[0]
    logits = z_I @ z_T.T / tau          # (N, N)
    labels = np.arange(N)
    # cross_entropy reduces with mean
    loss_i2t = cross_entropy(logits, labels)        # i-th row → label i
    loss_t2i = cross_entropy(logits.T, labels)      # i-th col → label i
    return (loss_i2t + loss_t2i) / 2

Two interpretations of the same matrix:

  • Row \(i\): which text matches image \(i\)? Softmax over all \(N\) texts in the batch. Ground truth is index \(i\).
  • Column \(j\): which image matches text \(j\)? Softmax over all \(N\) images in the batch. Ground truth is index \(j\).

Symmetric loss = average of both.

Why batch size matters

The "negatives" for image \(i\) are all texts \(j \ne i\) in the batch. With batch 32, you have 31 negatives. CLIP-the-paper used batch 32,768 → 32,767 negatives.

For our toy problem (small dataset, easy task), 31 negatives is enough. Do not try to make this lab match CLIP scale.

The temperature

CLIP uses a learnable temperature parameter, initialized to \(1/0.07 \approx 14.29\) (so logits are scaled up by ~14, then softmax produces sharper distributions). The temperature is learnable but clamped in CLIP — it can't exceed \(\exp(4.6) \approx 100\) to prevent the loss from diverging.

For our toy problem, fix \(\tau = 0.1\) (so \(1/\tau = 10\)) and don't make it learnable. One less hyperparameter to debug.


TODOs

Block A — design

  • BLUEPRINT.md first, before code. Include:
  • Architecture diagram with shapes.
  • Param count to within 10% of estimate.
  • Train/test split policy. Critical: hold out 4 of the 20 verbs entirely from training, so test set has unseen verbs. This forces the model to learn compositional structure (color = tense, shape = person, pattern = verb) rather than memorize.
  • Hyperparameters: batch size 32, LR 1e-3 with cosine, n_epochs 30, tau 0.1.
  • Test plan: top-1 and top-5 retrieval in both directions, on held-out verbs.

Pause for approval.

Block B — implement encoders

  • image_encoder.py. Subclass / wrap your lab 00 ViT. Drop the classification head; expose forward(image) → z_I (B, d).
  • text_encoder.py. New tiny transformer (2 blocks, d_model=64, 4 heads). Embedding + position + blocks + take [EOS]. Reuse Phase 17 transformer block.

Block C — CLIP model + loss

  • clip_model.py. Wraps both encoders + projection heads + L2-normalize.
  • loss.py. symmetric_infonce with cross_entropy helper.
  • Unit-test the loss:
  • With \(z_I = z_T\) (perfect alignment), batch 4, \(\tau = 0.1\): loss should equal \(-\log(\text{softmax-of-diag})\) ≈ 0. (Not exactly 0 because the diagonal isn't \(+\infty\); but very small.)
  • With \(z_I\) random and \(z_T\) random, batch 4: loss should be around \(\log(4) \approx 1.39\) (uniform over 4 classes).

Block D — training loop

  • train.py. Standard loop. Log loss + retrieval-accuracy-on-batch each step.
  • Save loss curve as loss.png.

Block E — eval

  • eval.py. For each held-out icon, compute its embedding + embed all held-out text snippets. Rank by cosine; report top-1 and top-5 accuracy.
  • Repeat in text → image direction.
  • Acceptance:
  • Top-1 retrieval (image → text) ≥ 50% on held-out verbs. (Random would be ~5%.)
  • Top-5 retrieval ≥ 80%.
  • If you hit these, you've verified the contrastive mechanism works on the grammar domain.
  • Save retrieval_curves.png (top-k vs k) and confusion_matrix_top1.png (predicted vs true class).

Block F — stretch goals

  • Replace symmetric softmax InfoNCE with SigLIP-style sigmoid loss. Compare convergence speed and final retrieval at batch 32. SigLIP should win at this small batch size.
  • Probe the modality gap from theory/00-motivation.md: compute centroid of all \(z_I\) embeddings, centroid of all \(z_T\) embeddings, measure cosine distance. Report. Compare before and after training. (Expectation: at init it's near 0.5–1.0; after training it's still > 0 but smaller. Confirms that the modality gap is real on a toy problem too.)
  • Visualize the learned embeddings via PCA-to-2D. Save embeddings_pca.png. Color image embeddings blue, text red. Do they form two clusters (modality gap) or interleave (no gap)?

Acceptance criteria

  1. BLUEPRINT.md approved.
  2. Image encoder reused from lab 00 (no copy-paste).
  3. Text encoder reuses Phase 17 transformer block.
  4. symmetric_infonce matches the spec; unit tests pass.
  5. Top-1 retrieval ≥ 50%, top-5 ≥ 80% on held-out verbs.
  6. Training ≤ 30 minutes wall-clock on CPU.
  7. README.md, manifest.json, all artifacts committed.

What this lab is intentionally NOT

  • Not a CLIP reproduction. Real CLIP needs orders of magnitude more data and batch size. We verify the mechanism.
  • Not zero-shot ImageNet classification. That requires a model that's seen real-world images, which we don't.
  • Not a comparison with non-contrastive losses (e.g. caption-generation cross-entropy). Could be a stretch goal.

Debugging hints

If top-1 retrieval is at chance (~5%) after several epochs:

  1. Check L2-normalization. After normalize, z_I.norm(axis=1) should be all 1. If not, you're not on a unit sphere; the cosine interpretation fails.
  2. Check loss decreasing. If flat, the gradient might not flow. Print the gradient norm of each encoder's first layer. Should be similar magnitude. If image grad norm is 100× text grad norm, you have a scale problem — likely the L2-normalize isn't differentiable / you forgot to backprop through it.
  3. Check the diagonal. Print logits before softmax for a small batch. The diagonal should grow more positive over training; off-diagonal should grow more negative.
  4. Held-out verbs are too hard. Try the eval first on in-distribution test split (icons from training verbs but unseen samples). If you hit > 80% there but fail on held-out, the model is memorizing the patterns of training verbs and not generalizing. That's still a valid finding — note it in the report.

If the modality gap (stretch goal) is huge after training (> 0.5 cosine distance between centroids):

  • Try training longer.
  • Try a smaller \(\tau\) (sharper softmax, more contrastive pressure).
  • Try shared projection (the same linear layer projects both modalities). Some CLIP variants do this; it forces alignment in the projection space.

What you'll have learned

  • Symmetric InfoNCE in practice: the symmetric softmax over a similarity matrix.
  • The batch size dependence of contrastive loss: more negatives = harder loss = better representations. At small batch, contrastive learning is weak.
  • The modality gap as a measurable phenomenon, not just a paper claim.
  • That building a "CLIP-style model" mechanically is easy — three components (encoder, encoder, contrastive loss). The hard part is the data and the scale.

Next lab: lab/02-whisper-inference-walkthrough.md switches to audio — load Whisper-tiny, run inference on prerecorded clips of English verb forms, inspect the internal attention and timestamp logits.