English · Español
Lab 00 — Tiny ViT on Grammar Icons¶
Goal: build a 4-block Vision Transformer on top of the Phase 17 transformer block, train it on a synthetic dataset of "grammar icons" (each verb-form rendered as a small colored image), and achieve > 80% top-1 accuracy on the held-out test set in ≤ 5 minutes of CPU training.
🇪🇸 Aplicamos el bloque transformer de la Fase 17 a una nueva modalidad — imagen — para entender que un transformer es agnóstico a la modalidad: lo único que cambia es la capa de embedding. Implementamos patchify + posición + bloque transformer + cabezal de clasificación.
Estimated time: 3–4 hours.
Prereqs: -
docs/extension-track/X2-multimodal/theory/01-vision-transformers.mdread. - Phase 17 transformer block implementation (src/minimodel/transformer/) complete. - Python 3.11 + uv env with NumPy, Pillow.
What you produce¶
A directory experiments/X2-multimodal/lab-00-tiny-vit/ containing:
BLUEPRINT.md— your design before any code. Patchify shape, ViT-tiny config, training loop sketch, test plan. Must be approved by user before code.generate_icons.py— Pillow script that produces 1000 synthetic icons.data/icons/— 1000 PNG files:verb_<verb>_tense_<tense>_person_<person>_<idx>.png.data/labels.json— flat list of{filename, verb_id, tense_id, person_id, class_id}.vit.py— your ViT implementation (importingTransformerBlockfromsrc/minimodel/transformer/).train.py— training script.eval.py— eval script with confusion-matrix output.manifest.json— versions, seed, config (seeCLAUDE.md§0.5).README.md— 1–2 paragraphs: what, results, surprises.loss.png,accuracy.png,confusion_matrix.png— committed artifacts.
The dataset¶
We render one synthetic icon per (verb, tense, person, sample-index) combination. The icon encodes:
- Color → tense.
- Infinitive: gray. Present-simple: blue. Past-simple: red. Past-participle: green. Future: purple.
- Shape → person.
- 1st-sg (I): circle. 2nd-sg (you): square. 3rd-sg (he/she/it): triangle.
- Pattern inside the shape → verb identity. Use one of 20 distinct internal patterns (alternating stripes, dots, crosshatch, etc.) tied to the 20 §A13 verbs.
The model's task: given a \(32 \times 32\) icon, classify which (verb, tense, person) it depicts. Multi-class with 20 × 5 × 3 = 300 classes, but we'll start with a simpler task: just classify tense (5 classes).
Why this dataset¶
- Synthetic. Reproducible bit-for-bit (
seed_everything). - Tied to §A13. Stays inside the grammar universe — this is not a "cats vs. dogs" example.
- Easy enough for a tiny ViT to learn on CPU. The color signal alone is enough for the tense task; a tiny ViT should hit 90%+. The "all 300 classes" version is a stretch goal.
- Reusable in lab 01. Lab 01 uses the same icons paired with text snippets for CLIP-style training.
Icon generation spec¶
For each (verb, tense, person) triple in §A13's grammar scope (20 × 5 × 3 = 300 combos):
# Pseudocode — fill in details in generate_icons.py
TENSE_COLOR = {
"infinitive": (128, 128, 128),
"present_simple": (50, 100, 200),
"past_simple": (200, 50, 50),
"past_participle": (50, 200, 100),
"future": (160, 80, 200),
}
PERSON_SHAPE = {
"1sg": "circle",
"2sg": "square",
"3sg": "triangle",
}
VERB_PATTERN = {
"work": "stripes_horizontal",
"play": "stripes_vertical",
# ... 18 more
}
for verb in VERBS:
for tense in TENSES:
for person in PERSONS:
for idx in range(N_PER_COMBO): # N_PER_COMBO = 1000 / 300 ≈ 3-4
img = render_icon(verb, tense, person, idx, jitter=True)
img.save(...)
Apply random jitter (position, slight rotation, slight color noise) so the model sees variation per (verb, tense, person), forcing it to learn the underlying features rather than memorize.
Total: ~1000 icons. ~3-4 per combo. Splits: 80% train (800), 20% test (200), stratified by combo so every class is in both splits.
Architecture: tiny ViT¶
A 4-block ViT, scaled down dramatically from ViT-Base.
Config¶
| Hyperparam | Value | Notes |
|---|---|---|
| image_size | 32 | \(32 \times 32\) RGB |
| patch_size | 4 | gives 64 patches |
| n_classes | 5 (tense), or 15 (tense × person), or 300 (full) | |
| d_model | 64 | tiny |
| n_heads | 4 | \(d_k = 16\) |
| n_blocks | 4 | reuse Phase 17 block × 4 |
| d_ff | 256 | 4× expansion |
| dropout | 0.1 | minimal |
Total params: ~150k. Small enough that you can sanity-check by hand-computing layer sizes.
Forward pass¶
input image (B, 32, 32, 3)
↓ patchify (P=4)
patches (B, 64, 48) # 4·4·3 = 48
↓ linear projection W_E (48 → 64)
patch tokens (B, 64, 64)
↓ prepend [CLS] token
tokens (B, 65, 64)
↓ + learnable position embedding (65, 64)
tokens (B, 65, 64)
↓ × 4 transformer blocks (Phase 17 block, no causal mask)
output (B, 65, 64)
↓ take [CLS] (index 0): (B, 64)
↓ classification head: Linear(64 → n_classes)
logits (B, n_classes)
What you reuse from Phase 17¶
from src.minimodel.transformer.block import TransformerBlock
The Phase 17 block does causal self-attention. For ViT we want bidirectional self-attention (no mask). Two options:
- Pass
mask=Noneif your Phase 17 block accepts an optional mask argument. - Add a
causal: bool = Trueflag to the block and override here.
Pick whichever requires fewer changes to Phase 17. If you have to add the flag, commit the change to Phase 17 separately with a note that it's required by X2 lab 00. Do not patch Phase 17 silently.
TODOs¶
Block A — design (write BLUEPRINT.md first)¶
In BLUEPRINT.md, before any code:
- Sketch the full computational graph with tensor shapes at every step.
- List the trainable params + their shape + total count. Sanity-check against ~150k total.
- Sketch the training loop: optimizer (SGD with momentum is fine — no Adam needed for 5 minutes of training), LR, batch size, n_epochs.
- Sketch the test plan: at least 5 unit tests for
vit.py(patchify shape, [CLS] prepending, position emb shape, full forward shape, gradient sanity). - List the alternatives you considered and why you rejected them.
Pause. Get user approval before continuing.
Block B — generate the dataset¶
In generate_icons.py:
- Read
src/minimodel/grammar/verbs.py(or wherever §A13's verb list lives) to get the canonical 20 verbs, 5 tenses, 3 persons. - Implement
render_icon(verb, tense, person, idx, jitter=True)→ returns a(32, 32, 3)uint8NumPy array. - Loop over all (verb, tense, person, idx) → save as
verb_<verb>_tense_<tense>_person_<person>_<idx>.png. - Write
labels.json. - Verify: count files, visually inspect 20 randomly-sampled icons (commit to
data/icons_sample.png).
Block C — implement ViT¶
In vit.py:
-
patchify(x, P)— the einsum fromtheory/01-vision-transformers.md. -
class TinyViT— wraps patchify + linear-proj + [CLS] + pos-emb + 4 transformer blocks + cls-head. - Initialize position embedding to small random (std 0.02). Initialize [CLS] to small random.
Block D — train¶
In train.py:
- Load icons + labels.
- Build train/test split (80/20 stratified by class).
- Train for ≤ 20 epochs, batch size 32, SGD with momentum 0.9, LR 1e-3 with cosine schedule.
- Log train loss + test accuracy each epoch.
- Save
loss.pngandaccuracy.png.
Block E — evaluate¶
In eval.py:
- Load the final model. Compute test accuracy.
- Compute confusion matrix. Save as
confusion_matrix.png. - Acceptance: test accuracy ≥ 80% on the tense-only task (5 classes). If you're below 80%, debug: check patchify output shape, attention output entropy, classification logits per class.
Block F — stretch goals (optional)¶
- Repeat for the 15-class (tense × person) task. Should still hit ≥ 70%.
- Repeat for the 300-class (verb × tense × person) task. Goal: ≥ 30% top-1 (chance is 0.33%; even 30% is meaningful). If you can't hit this, the task probably exceeds the model's capacity at 150k params — note this in the report.
- Visualize the attention pattern of the [CLS] token over the 64 patches for one example per class. Save as
cls_attention.png. Does it focus on the regions that encode the answer?
Acceptance criteria¶
You ship lab 00 when:
BLUEPRINT.mdapproved by user before code.- 1000 icons generated, reproducibly (seed in manifest).
- ViT implementation reuses Phase 17's transformer block — no copy-pasting of attention code.
- Tense-only task achieves ≥ 80% test accuracy.
- Training takes ≤ 5 minutes wall-clock on the i5-8250U CPU (4 cores). Profile and report if over.
- Confusion matrix committed.
README.mdwritten, 1–2 paragraphs.manifest.jsonincludes: NumPy version, Pillow version, seed, hyperparams, final test accuracy, training wall-clock.
What this lab is intentionally NOT¶
- Not a SOTA experiment. 150k params on synthetic icons proves the mechanism, not the limits. Real ViT-Base on ImageNet is 86M params trained for days; we have a 150k-param toy.
- Not a generic image classifier. The icons are deliberately tied to §A13 grammar.
- Not a benchmark of ViT vs CNN. That comparison needs orders of magnitude more data than 1000 synthetic icons. We're verifying the mechanism.
Debugging hints¶
If accuracy is stuck at chance (20% for 5 classes):
- Check patchify output. Print
patches.shape. Should be(B, 64, 48). If it's(B, 48, 64)you have axes swapped. - Check [CLS] prepending. After prepending, shape is
(B, 65, 64). After 4 blocks, still(B, 65, 64). Take index 0 along axis 1, not axis 0 (that would take batch). - Check position embedding. Should be shape
(65, 64)and broadcast over batch. If you wrote(64, 65)accidentally, you're adding position to the wrong axis. - Check classification head input. Should be the [CLS] representation, shape
(B, 64). Not the mean of all tokens. - Check loss decreasing. If train loss is flat, gradient probably zero somewhere. Print
np.linalg.norm(grad)on each block's weights.
If train accuracy >> test accuracy: overfitting on 800 training images is expected for 150k params. Add dropout (0.1–0.3) or reduce model size.
If training takes > 5 min: profile. The bottleneck is usually attention. With 64 tokens and d=64, attention is small (\(O(T^2 d) = 64^2 \cdot 64 = 262\)k ops per head per layer). The bottleneck might be the patchify reshape (if you wrote it as a loop rather than an einsum).
What you'll have learned¶
By the end of lab 00:
- A transformer block is modality-agnostic. The same architecture works for text (Phase 17) and images (this lab).
- Patchify is the only architectural difference between a vision transformer and a language transformer. Everything else is reused.
- A 150k-param ViT on synthetic data is the smallest meaningful vision-transformer experiment. It runs on a CPU in 5 minutes.
- Real ViT-Base is the same architecture scaled up ~500× in parameters and ~\(10^6 \times\) in data. The mechanics extend; the data needs are what's different.
Next lab: lab/01-clip-style-grammar-image-pairing.md reuses this ViT (without the classification head) as the image encoder for a CLIP-style contrastive setup, paired with a tiny text encoder.