English · Español
Lab 02 — Whisper Inference Walkthrough¶
Goal: load OpenAI's
whisper-tiny.en(39M params) via HuggingFacetransformers, feed it 24 prerecorded audio clips of English verb forms (one per (verb, tense, person) sample from §A13), and inspect (a) the decoded token sequence, (b) the per-token logits, © the timestamp tokens, (d) the encoder-decoder cross-attention pattern. No training. Inference + introspection only.🇪🇸 Esta lab es inspección, no entrenamiento. Cargamos Whisper-tiny pre-entrenado, le damos clips de audio de verbos en distintas conjugaciones, y abrimos la caja: ¿qué tokens emite? ¿qué timestamps marca? ¿a qué frames de audio presta atención cuando genera cada token de texto? Es el punto donde la teoría del frontend log-mel y de cross-attention se vuelve concreta.
Estimated time: 3–4 hours.
Prereqs: -
docs/extension-track/X2-multimodal/theory/02-audio-models.mdread. -huggingface_hub,transformers,librosa(orsoundfile) installed in youruvenv. - The 24 audio clips committed indocs/extension-track/X2-multimodal/lab/data/audio/. See "Data" below for how to obtain or generate.
Why this lab uses transformers (a one-time exception)¶
CLAUDE.md §0.4 says: no transformers lib before Phase 24.
X2 is an extension track, not part of the 40-phase core ritual. We're post-Phase-24 in spirit. Plus:
- This lab is inference-only. There's no training mechanic to reveal.
- Re-implementing Whisper's log-mel frontend + conv stem + 4-layer encoder + 4-layer decoder + BPE tokenizer from scratch is a ~3-week project that adds zero conceptual insight beyond what
theory/02-audio-models.mdalready derives. - The lab's goal is introspection of a pretrained checkpoint, which requires loading that checkpoint, which requires
transformers(or replicating its checkpoint-loading code, which is the same compromise).
The exception is local to this lab. Lab 00 and lab 01 stay NumPy-only.
Data¶
We need 24 prerecorded audio clips. One per (verb, tense, person) sample for 8 verbs chosen from §A13 (a subset; full 300 would be excessive).
Verb subset (24 clips = 8 verbs × 3 forms each, mixing tenses/persons)¶
A representative set covering all 5 tenses and all 3 persons:
| Clip # | Sentence | Verb | Tense | Person |
|---|---|---|---|---|
| 01 | "I work" | work | present_simple | 1sg |
| 02 | "you worked" | work | past_simple | 2sg |
| 03 | "he works" | work | present_simple | 3sg |
| 04 | "I will play" | play | future | 1sg |
| 05 | "you played" | play | past_simple | 2sg |
| 06 | "she has played" | play | past_participle | 3sg |
| 07 | "to walk" | walk | infinitive | - |
| 08 | "he walks" | walk | present_simple | 3sg |
| 09 | "I talked" | talk | past_simple | 1sg |
| 10 | "you will talk" | talk | future | 2sg |
| 11 | "she listens" | listen | present_simple | 3sg |
| 12 | "I have listened" | listen | past_participle | 1sg |
| 13 | "you watch" | watch | present_simple | 2sg |
| 14 | "he watched" | watch | past_simple | 3sg |
| 15 | "I am" | be | present_simple | 1sg |
| 16 | "you were" | be | past_simple | 2sg |
| 17 | "she has been" | be | past_participle | 3sg |
| 18 | "I will be" | be | future | 1sg |
| 19 | "I go" | go | present_simple | 1sg |
| 20 | "you went" | go | past_simple | 2sg |
| 21 | "he has gone" | go | past_participle | 3sg |
| 22 | "to eat" | eat | infinitive | - |
| 23 | "I ate" | eat | past_simple | 1sg |
| 24 | "she will eat" | eat | future | 3sg |
How to obtain the clips¶
Two options:
Option A (preferred, deterministic): generate from text-to-speech.
- Use pyttsx3 (offline, no network) or espeak-ng from CLI.
- One voice, fixed pitch/rate, 16 kHz mono WAV.
- Save as data/audio/clip_<NN>.wav where NN is the clip number.
- Commit a data/audio/generate.py script that produces all 24 clips deterministically from the table above.
Option B (more realistic): record yourself.
- Use arecord or sox or any DAW. 16 kHz, mono, ~2-3 seconds per clip.
- Commit the WAV files. Note in README that they're human-recorded and not reproducible bit-for-bit.
Either is fine for the lab. Option A is recommended for reproducibility — Whisper's behavior on TTS audio is its own interesting question (it transcribes TTS well; that's how it was partly trained).
What you produce¶
A directory experiments/X2-multimodal/lab-02-whisper-inference/ containing:
BLUEPRINT.mdtranscribe.py— load Whisper-tiny.en, transcribe all 24 clips, save outputs.inspect_tokens.py— for each clip, dump the BPE tokens + their probabilities + their timestamps.inspect_cross_attention.py— for each clip, extract the cross-attention pattern from the decoder.plot_cross_attention.py— visualize cross-attention as a heatmap (decoder steps × encoder positions).transcriptions.json—{clip_id: {expected, predicted, wer, tokens, token_probs, timestamps}}.manifest.json.README.md.cross_attention_<NN>.png— one heatmap per clip.
Architecture quick-recap (read theory/02 first)¶
audio waveform (T_audio at 16 kHz)
↓ log-mel (80 mel × 3000 frames; padded for < 30s)
↓ Conv1D × 2 (stride 1 then stride 2) → (1500 frames, d_model=384)
↓ + sinusoidal positional embedding
↓ encoder: 4 transformer blocks (no causal mask)
↓ encoder_output: (1500, 384)
│
↓ cross-attention key/value
↓
[<|startoftranscript|> <|en|> <|transcribe|>] (or +timestamp tokens)
↓ decoder: 4 transformer blocks (causal self-attn + cross-attn to encoder)
↓ next-token logits (vocab_size = 51865)
↓ greedy / beam decode
↓ "I work" (with optional timestamp tokens interspersed)
You will be poking at: - The output token sequence (decoder output). - The probability distribution at each decoded step (introspection of decoder uncertainty). - The cross-attention weights at each decoder layer (which audio frames did the decoder look at when generating each text token).
TODOs¶
Block A — BLUEPRINT¶
- Sketch how you'll load Whisper, get the encoder hidden states, get the decoder cross-attention, plot. Reference the
transformersAPI:WhisperProcessor,WhisperForConditionalGeneration,model.generate(..., output_attentions=True, return_dict_in_generate=True). - List the model's actual config to verify (vocab_size, d_model, n_layers).
Block B — transcribe¶
In transcribe.py:
- Load
openai/whisper-tiny.enand itsWhisperProcessor. - For each WAV in
data/audio/: - Load via
librosa.load(path, sr=16000). - Pass through
processorto getinput_features(the log-mel). - Verify
input_features.shape == (1, 80, 3000). - Call
model.generate(input_features, return_timestamps=True, return_dict_in_generate=True, output_scores=True, output_attentions=True). - Decode generated IDs with
processor.batch_decode(..., skip_special_tokens=False). - Save expected vs predicted transcription per clip.
- Compute word error rate (WER) per clip. (For 24 simple clips, WER should be near 0% — Whisper is very good at clean speech. Any errors are themselves interesting findings.)
Block C — token introspection¶
In inspect_tokens.py:
- For each clip, walk through the generated token sequence step by step.
- At each step, log:
- The token ID and its decoded text.
- The top-5 alternatives + their softmax probabilities (from
outputs.scores). - Whether this token is a timestamp token (token ID in the
<|N.NN|>range). - Findings to look for:
- Are the timestamp tokens roughly where the speech occurred? (For "I work" the first text token should follow a
<|0.00|>and end with a<|1.5|>or so.) - On past-tense vs present-tense: does Whisper's confidence drop at the verb-form token (e.g. "worked" vs "works")? Are alternatives semantically similar?
- On the
<|en|> <|transcribe|>task prefix: are these very high confidence (they should be — the model has been trained with these tokens millions of times)?
Block D — cross-attention extraction¶
In inspect_cross_attention.py:
- For each clip, extract the decoder cross-attention from
outputs.cross_attentions. - Shape per generation step:
(n_layers, batch, n_heads, query_len, key_len) = (4, 1, 6, 1, 1500). - Average over heads (or pick a single head — head 4 is often the most interpretable for Whisper-tiny; this is a known empirical finding).
- You now have, for each decoded text token, a distribution over the 1500 encoder positions.
Block E — plot¶
In plot_cross_attention.py:
- For each clip, plot a heatmap: rows = decoded text tokens (e.g.
"I", " work"), cols = encoder positions 0..1499. Color = cross-attention weight. - Annotate the row labels with the decoded text.
- The expected pattern (Whisper's hallmark behavior): monotonic, near-diagonal alignment. The cross-attention "scans" left-to-right across the audio as text is generated. This is the basis of Whisper's timestamp prediction.
- Save as
cross_attention_<NN>.png.
Block F — analysis¶
In README.md:
- Did all 24 clips transcribe correctly (WER ≈ 0)?
- Which clips had the lowest decoder confidence at the verb token? What were the top-5 alternatives at that step?
- Were the timestamps accurate? Compare against the actual clip duration.
- Does the cross-attention show monotonic alignment? Pick the cleanest example and the messiest example to discuss.
- Connect to grammar tutor: if you wanted to flag pronunciation errors (e.g. user says "he work" instead of "he works"), how would you use Whisper's per-token probabilities? (Hint: look at the probability of
worksvsworkat the verb-token position when the audio is "he work".)
Block G — stretch goals¶
- Adversarial test. Record 5 clips with deliberate grammar errors ("he work", "you was", "I has"). Does Whisper transcribe them as spoken (preserving the error), or does it "correct" them to grammatical English? This is a known property of large speech models: they often "fix" small grammar errors silently. Report what you find.
- Cross-attention head specialization. For one clip, plot each of the 6 attention heads separately. Are any heads specialized (e.g. one for boundary detection, one for vowel duration)? This is exploratory and findings vary.
- Greedy vs beam decode comparison. Generate the same clip with
num_beams=1vsnum_beams=5. Are the outputs identical, or does beam search find a different transcription?
Acceptance criteria¶
BLUEPRINT.mdapproved.- All 24 clips transcribed; WER reported per clip.
- Per-token probabilities dumped for at least 5 representative clips.
- Cross-attention heatmaps committed for all 24 clips.
README.mddiscussion includes the monotonic-alignment finding and at least one surprising or unexpected observation.manifest.jsonincludes Whisper checkpoint hash,transformersversion,librosaversion, seed (if relevant for any stochastic decoding), wall-clock for full inference run.
What this lab is intentionally NOT¶
- Not Whisper training. We're not retraining. The model is used as-is.
- Not a comparison of Whisper to other ASR. No wav2vec, no Conformer-RNN-T. One model, deep inspection.
- Not streaming. Whisper is batch. All 24 clips are < 30 s.
- Not multilingual.
whisper-tiny.enis English-only. The multilingualwhisper-tiny(74 M, no.en) has a different vocabulary and behavior.
What you'll have learned¶
- The log-mel frontend is concrete: a 30 s clip →
(80, 3000)log-mel matrix. Theprocessordoes this for you, but now you know what's ininput_features. - Cross-attention is interpretable in Whisper. The decoder's "look at the audio for the next text token" pattern is visible and roughly monotonic. This is the basis of Whisper's timestamp logic.
- A 39 M-param model is enough for clean speech in a known language. Most "transcription" tasks don't need GPT-scale; they need 39 M trained on enough audio.
- The model has language-model priors baked in. It will sometimes "fix" grammar errors silently. This is great for ASR-as-product, bad for grammar-tutor-as-product. The grammar tutor needs a model that transcribes exactly what was said, not a smoothed version.
That last point is the interview-relevant finding: when you choose an ASR for a grammar tutor, you specifically want a model with less language-model prior than Whisper — perhaps wav2vec2 fine-tuned with CTC loss, which doesn't have the autoregressive language-model bias. The right model depends on whether you want a fluent transcript or a verbatim one.
Cross-references¶
theory/02-audio-models.md— the log-mel derivation and Whisper architecture.theory/03-fusion-strategies.md§"unified-token" — note that Whisper is not unified-token; it's a dedicated audio encoder + text decoder. GPT-4o's audio capability is unified-token.docs/phase-15-attention/theory/03-multi-head.md— multi-head attention. Whisper's cross-attention is the "encoder is keys/values, decoder is queries" variant we discussed there.HIRING_PATH.md— "audio gap" line item — closed by completing this lab + the theory.