English · Español

Lab 00 — Tiny ViT on Grammar Icons¶

Goal: build a 4-block Vision Transformer on top of the Phase 17 transformer block, train it on a synthetic dataset of "grammar icons" (each verb-form rendered as a small colored image), and achieve > 80% top-1 accuracy on the held-out test set in ≤ 5 minutes of CPU training.

🇪🇸 Aplicamos el bloque transformer de la Fase 17 a una nueva modalidad — imagen — para entender que un transformer es agnóstico a la modalidad: lo único que cambia es la capa de embedding. Implementamos patchify + posición + bloque transformer + cabezal de clasificación.

Estimated time: 3–4 hours.

Prereqs: - docs/extension-track/X2-multimodal/theory/01-vision-transformers.md read. - Phase 17 transformer block implementation (src/minimodel/transformer/) complete. - Python 3.11 + uv env with NumPy, Pillow.

What you produce¶

A directory experiments/X2-multimodal/lab-00-tiny-vit/ containing:

BLUEPRINT.md — your design before any code. Patchify shape, ViT-tiny config, training loop sketch, test plan. Must be approved by user before code.
generate_icons.py — Pillow script that produces 1000 synthetic icons.
data/icons/ — 1000 PNG files: verb_<verb>_tense_<tense>_person_<person>_<idx>.png.
data/labels.json — flat list of {filename, verb_id, tense_id, person_id, class_id}.
vit.py — your ViT implementation (importing TransformerBlock from src/minimodel/transformer/).
train.py — training script.
eval.py — eval script with confusion-matrix output.
manifest.json — versions, seed, config (see CLAUDE.md §0.5).
README.md — 1–2 paragraphs: what, results, surprises.
loss.png, accuracy.png, confusion_matrix.png — committed artifacts.

The dataset¶

We render one synthetic icon per (verb, tense, person, sample-index) combination. The icon encodes:

Color → tense.
Infinitive: gray. Present-simple: blue. Past-simple: red. Past-participle: green. Future: purple.
Shape → person.
1^st-sg (I): circle. 2^nd-sg (you): square. 3^rd-sg (he/she/it): triangle.
Pattern inside the shape → verb identity. Use one of 20 distinct internal patterns (alternating stripes, dots, crosshatch, etc.) tied to the 20 §A13 verbs.

The model's task: given a \(32 \times 32\) icon, classify which (verb, tense, person) it depicts. Multi-class with 20 × 5 × 3 = 300 classes, but we'll start with a simpler task: just classify tense (5 classes).

Why this dataset¶

Synthetic. Reproducible bit-for-bit (seed_everything).
Tied to §A13. Stays inside the grammar universe — this is not a "cats vs. dogs" example.
Easy enough for a tiny ViT to learn on CPU. The color signal alone is enough for the tense task; a tiny ViT should hit 90%+. The "all 300 classes" version is a stretch goal.
Reusable in lab 01. Lab 01 uses the same icons paired with text snippets for CLIP-style training.

Icon generation spec¶

For each (verb, tense, person) triple in §A13's grammar scope (20 × 5 × 3 = 300 combos):

# Pseudocode — fill in details in generate_icons.py
TENSE_COLOR = {
    "infinitive": (128, 128, 128),
    "present_simple": (50, 100, 200),
    "past_simple": (200, 50, 50),
    "past_participle": (50, 200, 100),
    "future": (160, 80, 200),
}
PERSON_SHAPE = {
    "1sg": "circle",
    "2sg": "square",
    "3sg": "triangle",
}
VERB_PATTERN = {
    "work": "stripes_horizontal",
    "play": "stripes_vertical",
    # ... 18 more
}

for verb in VERBS:
    for tense in TENSES:
        for person in PERSONS:
            for idx in range(N_PER_COMBO):  # N_PER_COMBO = 1000 / 300 ≈ 3-4
                img = render_icon(verb, tense, person, idx, jitter=True)
                img.save(...)

Apply random jitter (position, slight rotation, slight color noise) so the model sees variation per (verb, tense, person), forcing it to learn the underlying features rather than memorize.

Total: ~1000 icons. ~3-4 per combo. Splits: 80% train (800), 20% test (200), stratified by combo so every class is in both splits.

Architecture: tiny ViT¶

A 4-block ViT, scaled down dramatically from ViT-Base.

Config¶

Hyperparam	Value	Notes
image_size	32	\(32 \times 32\) RGB
patch_size	4	gives 64 patches
n_classes	5 (tense), or 15 (tense × person), or 300 (full)
d_model	64	tiny
n_heads	4	\(d_k = 16\)
n_blocks	4	reuse Phase 17 block × 4
d_ff	256	4× expansion
dropout	0.1	minimal

Total params: ~150k. Small enough that you can sanity-check by hand-computing layer sizes.

Forward pass¶

input image (B, 32, 32, 3)
  ↓ patchify (P=4)
patches (B, 64, 48)   # 4·4·3 = 48
  ↓ linear projection W_E (48 → 64)
patch tokens (B, 64, 64)
  ↓ prepend [CLS] token
tokens (B, 65, 64)
  ↓ + learnable position embedding (65, 64)
tokens (B, 65, 64)
  ↓ × 4 transformer blocks (Phase 17 block, no causal mask)
output (B, 65, 64)
  ↓ take [CLS] (index 0): (B, 64)
  ↓ classification head: Linear(64 → n_classes)
logits (B, n_classes)

What you reuse from Phase 17¶

from src.minimodel.transformer.block import TransformerBlock

The Phase 17 block does causal self-attention. For ViT we want bidirectional self-attention (no mask). Two options:

Pass mask=None if your Phase 17 block accepts an optional mask argument.
Add a causal: bool = True flag to the block and override here.

Pick whichever requires fewer changes to Phase 17. If you have to add the flag, commit the change to Phase 17 separately with a note that it's required by X2 lab 00. Do not patch Phase 17 silently.

TODOs¶

Block A — design (write BLUEPRINT.md first)¶

In BLUEPRINT.md, before any code:

Sketch the full computational graph with tensor shapes at every step.
List the trainable params + their shape + total count. Sanity-check against ~150k total.
Sketch the training loop: optimizer (SGD with momentum is fine — no Adam needed for 5 minutes of training), LR, batch size, n_epochs.
Sketch the test plan: at least 5 unit tests for vit.py (patchify shape, [CLS] prepending, position emb shape, full forward shape, gradient sanity).
List the alternatives you considered and why you rejected them.

Pause. Get user approval before continuing.

Block B — generate the dataset¶

In generate_icons.py:

Read src/minimodel/grammar/verbs.py (or wherever §A13's verb list lives) to get the canonical 20 verbs, 5 tenses, 3 persons.
Implement render_icon(verb, tense, person, idx, jitter=True) → returns a (32, 32, 3) uint8 NumPy array.
Loop over all (verb, tense, person, idx) → save as verb_<verb>_tense_<tense>_person_<person>_<idx>.png.
Write labels.json.
Verify: count files, visually inspect 20 randomly-sampled icons (commit to data/icons_sample.png).

Block C — implement ViT¶

In vit.py:

patchify(x, P) — the einsum from theory/01-vision-transformers.md.
class TinyViT — wraps patchify + linear-proj + [CLS] + pos-emb + 4 transformer blocks + cls-head.
Initialize position embedding to small random (std 0.02). Initialize [CLS] to small random.

Block D — train¶

In train.py:

Load icons + labels.
Build train/test split (80/20 stratified by class).
Train for ≤ 20 epochs, batch size 32, SGD with momentum 0.9, LR 1e-3 with cosine schedule.
Log train loss + test accuracy each epoch.
Save loss.png and accuracy.png.

Block E — evaluate¶

In eval.py:

Load the final model. Compute test accuracy.
Compute confusion matrix. Save as confusion_matrix.png.
Acceptance: test accuracy ≥ 80% on the tense-only task (5 classes). If you're below 80%, debug: check patchify output shape, attention output entropy, classification logits per class.

Block F — stretch goals (optional)¶

Repeat for the 15-class (tense × person) task. Should still hit ≥ 70%.
Repeat for the 300-class (verb × tense × person) task. Goal: ≥ 30% top-1 (chance is 0.33%; even 30% is meaningful). If you can't hit this, the task probably exceeds the model's capacity at 150k params — note this in the report.
Visualize the attention pattern of the [CLS] token over the 64 patches for one example per class. Save as cls_attention.png. Does it focus on the regions that encode the answer?

Acceptance criteria¶

You ship lab 00 when:

BLUEPRINT.md approved by user before code.
1000 icons generated, reproducibly (seed in manifest).
ViT implementation reuses Phase 17's transformer block — no copy-pasting of attention code.
Tense-only task achieves ≥ 80% test accuracy.
Training takes ≤ 5 minutes wall-clock on the i5-8250U CPU (4 cores). Profile and report if over.
Confusion matrix committed.
README.md written, 1–2 paragraphs.
manifest.json includes: NumPy version, Pillow version, seed, hyperparams, final test accuracy, training wall-clock.

What this lab is intentionally NOT¶

Not a SOTA experiment. 150k params on synthetic icons proves the mechanism, not the limits. Real ViT-Base on ImageNet is 86M params trained for days; we have a 150k-param toy.
Not a generic image classifier. The icons are deliberately tied to §A13 grammar.
Not a benchmark of ViT vs CNN. That comparison needs orders of magnitude more data than 1000 synthetic icons. We're verifying the mechanism.

Debugging hints¶

If accuracy is stuck at chance (20% for 5 classes):

Check patchify output. Print patches.shape. Should be (B, 64, 48). If it's (B, 48, 64) you have axes swapped.
Check [CLS] prepending. After prepending, shape is (B, 65, 64). After 4 blocks, still (B, 65, 64). Take index 0 along axis 1, not axis 0 (that would take batch).
Check position embedding. Should be shape (65, 64) and broadcast over batch. If you wrote (64, 65) accidentally, you're adding position to the wrong axis.
Check classification head input. Should be the [CLS] representation, shape (B, 64). Not the mean of all tokens.
Check loss decreasing. If train loss is flat, gradient probably zero somewhere. Print np.linalg.norm(grad) on each block's weights.

If train accuracy >> test accuracy: overfitting on 800 training images is expected for 150k params. Add dropout (0.1–0.3) or reduce model size.

If training takes > 5 min: profile. The bottleneck is usually attention. With 64 tokens and d=64, attention is small (\(O(T^2 d) = 64^2 \cdot 64 = 262\)k ops per head per layer). The bottleneck might be the patchify reshape (if you wrote it as a loop rather than an einsum).

What you'll have learned¶

By the end of lab 00:

A transformer block is modality-agnostic. The same architecture works for text (Phase 17) and images (this lab).
Patchify is the only architectural difference between a vision transformer and a language transformer. Everything else is reused.
A 150k-param ViT on synthetic data is the smallest meaningful vision-transformer experiment. It runs on a CPU in 5 minutes.
Real ViT-Base is the same architecture scaled up ~500× in parameters and ~\(10^6 \times\) in data. The mechanics extend; the data needs are what's different.

Next lab: lab/01-clip-style-grammar-image-pairing.md reuses this ViT (without the classification head) as the image encoder for a CLIP-style contrastive setup, paired with a tiny text encoder.