Skip to content

English · Español

Lab 02 — Whisper Inference Walkthrough

Goal: load OpenAI's whisper-tiny.en (39M params) via HuggingFace transformers, feed it 24 prerecorded audio clips of English verb forms (one per (verb, tense, person) sample from §A13), and inspect (a) the decoded token sequence, (b) the per-token logits, © the timestamp tokens, (d) the encoder-decoder cross-attention pattern. No training. Inference + introspection only.

🇪🇸 Esta lab es inspección, no entrenamiento. Cargamos Whisper-tiny pre-entrenado, le damos clips de audio de verbos en distintas conjugaciones, y abrimos la caja: ¿qué tokens emite? ¿qué timestamps marca? ¿a qué frames de audio presta atención cuando genera cada token de texto? Es el punto donde la teoría del frontend log-mel y de cross-attention se vuelve concreta.

Estimated time: 3–4 hours.

Prereqs: - docs/extension-track/X2-multimodal/theory/02-audio-models.md read. - huggingface_hub, transformers, librosa (or soundfile) installed in your uv env. - The 24 audio clips committed in docs/extension-track/X2-multimodal/lab/data/audio/. See "Data" below for how to obtain or generate.


Why this lab uses transformers (a one-time exception)

CLAUDE.md §0.4 says: no transformers lib before Phase 24.

X2 is an extension track, not part of the 40-phase core ritual. We're post-Phase-24 in spirit. Plus:

  • This lab is inference-only. There's no training mechanic to reveal.
  • Re-implementing Whisper's log-mel frontend + conv stem + 4-layer encoder + 4-layer decoder + BPE tokenizer from scratch is a ~3-week project that adds zero conceptual insight beyond what theory/02-audio-models.md already derives.
  • The lab's goal is introspection of a pretrained checkpoint, which requires loading that checkpoint, which requires transformers (or replicating its checkpoint-loading code, which is the same compromise).

The exception is local to this lab. Lab 00 and lab 01 stay NumPy-only.


Data

We need 24 prerecorded audio clips. One per (verb, tense, person) sample for 8 verbs chosen from §A13 (a subset; full 300 would be excessive).

Verb subset (24 clips = 8 verbs × 3 forms each, mixing tenses/persons)

A representative set covering all 5 tenses and all 3 persons:

Clip # Sentence Verb Tense Person
01 "I work" work present_simple 1sg
02 "you worked" work past_simple 2sg
03 "he works" work present_simple 3sg
04 "I will play" play future 1sg
05 "you played" play past_simple 2sg
06 "she has played" play past_participle 3sg
07 "to walk" walk infinitive -
08 "he walks" walk present_simple 3sg
09 "I talked" talk past_simple 1sg
10 "you will talk" talk future 2sg
11 "she listens" listen present_simple 3sg
12 "I have listened" listen past_participle 1sg
13 "you watch" watch present_simple 2sg
14 "he watched" watch past_simple 3sg
15 "I am" be present_simple 1sg
16 "you were" be past_simple 2sg
17 "she has been" be past_participle 3sg
18 "I will be" be future 1sg
19 "I go" go present_simple 1sg
20 "you went" go past_simple 2sg
21 "he has gone" go past_participle 3sg
22 "to eat" eat infinitive -
23 "I ate" eat past_simple 1sg
24 "she will eat" eat future 3sg

How to obtain the clips

Two options:

Option A (preferred, deterministic): generate from text-to-speech. - Use pyttsx3 (offline, no network) or espeak-ng from CLI. - One voice, fixed pitch/rate, 16 kHz mono WAV. - Save as data/audio/clip_<NN>.wav where NN is the clip number. - Commit a data/audio/generate.py script that produces all 24 clips deterministically from the table above.

Option B (more realistic): record yourself. - Use arecord or sox or any DAW. 16 kHz, mono, ~2-3 seconds per clip. - Commit the WAV files. Note in README that they're human-recorded and not reproducible bit-for-bit.

Either is fine for the lab. Option A is recommended for reproducibility — Whisper's behavior on TTS audio is its own interesting question (it transcribes TTS well; that's how it was partly trained).


What you produce

A directory experiments/X2-multimodal/lab-02-whisper-inference/ containing:

  • BLUEPRINT.md
  • transcribe.py — load Whisper-tiny.en, transcribe all 24 clips, save outputs.
  • inspect_tokens.py — for each clip, dump the BPE tokens + their probabilities + their timestamps.
  • inspect_cross_attention.py — for each clip, extract the cross-attention pattern from the decoder.
  • plot_cross_attention.py — visualize cross-attention as a heatmap (decoder steps × encoder positions).
  • transcriptions.json{clip_id: {expected, predicted, wer, tokens, token_probs, timestamps}}.
  • manifest.json.
  • README.md.
  • cross_attention_<NN>.png — one heatmap per clip.

Architecture quick-recap (read theory/02 first)

audio waveform (T_audio at 16 kHz)
  ↓ log-mel (80 mel × 3000 frames; padded for < 30s)
  ↓ Conv1D × 2 (stride 1 then stride 2) → (1500 frames, d_model=384)
  ↓ + sinusoidal positional embedding
  ↓ encoder: 4 transformer blocks (no causal mask)
  ↓ encoder_output: (1500, 384)
  ↓ cross-attention key/value
[<|startoftranscript|> <|en|> <|transcribe|>] (or +timestamp tokens)
  ↓ decoder: 4 transformer blocks (causal self-attn + cross-attn to encoder)
  ↓ next-token logits (vocab_size = 51865)
  ↓ greedy / beam decode
  ↓ "I work" (with optional timestamp tokens interspersed)

You will be poking at: - The output token sequence (decoder output). - The probability distribution at each decoded step (introspection of decoder uncertainty). - The cross-attention weights at each decoder layer (which audio frames did the decoder look at when generating each text token).


TODOs

Block A — BLUEPRINT

  • Sketch how you'll load Whisper, get the encoder hidden states, get the decoder cross-attention, plot. Reference the transformers API: WhisperProcessor, WhisperForConditionalGeneration, model.generate(..., output_attentions=True, return_dict_in_generate=True).
  • List the model's actual config to verify (vocab_size, d_model, n_layers).

Block B — transcribe

In transcribe.py:

  • Load openai/whisper-tiny.en and its WhisperProcessor.
  • For each WAV in data/audio/:
  • Load via librosa.load(path, sr=16000).
  • Pass through processor to get input_features (the log-mel).
  • Verify input_features.shape == (1, 80, 3000).
  • Call model.generate(input_features, return_timestamps=True, return_dict_in_generate=True, output_scores=True, output_attentions=True).
  • Decode generated IDs with processor.batch_decode(..., skip_special_tokens=False).
  • Save expected vs predicted transcription per clip.
  • Compute word error rate (WER) per clip. (For 24 simple clips, WER should be near 0% — Whisper is very good at clean speech. Any errors are themselves interesting findings.)

Block C — token introspection

In inspect_tokens.py:

  • For each clip, walk through the generated token sequence step by step.
  • At each step, log:
  • The token ID and its decoded text.
  • The top-5 alternatives + their softmax probabilities (from outputs.scores).
  • Whether this token is a timestamp token (token ID in the <|N.NN|> range).
  • Findings to look for:
  • Are the timestamp tokens roughly where the speech occurred? (For "I work" the first text token should follow a <|0.00|> and end with a <|1.5|> or so.)
  • On past-tense vs present-tense: does Whisper's confidence drop at the verb-form token (e.g. "worked" vs "works")? Are alternatives semantically similar?
  • On the <|en|> <|transcribe|> task prefix: are these very high confidence (they should be — the model has been trained with these tokens millions of times)?

Block D — cross-attention extraction

In inspect_cross_attention.py:

  • For each clip, extract the decoder cross-attention from outputs.cross_attentions.
  • Shape per generation step: (n_layers, batch, n_heads, query_len, key_len) = (4, 1, 6, 1, 1500).
  • Average over heads (or pick a single head — head 4 is often the most interpretable for Whisper-tiny; this is a known empirical finding).
  • You now have, for each decoded text token, a distribution over the 1500 encoder positions.

Block E — plot

In plot_cross_attention.py:

  • For each clip, plot a heatmap: rows = decoded text tokens (e.g. "I", " work"), cols = encoder positions 0..1499. Color = cross-attention weight.
  • Annotate the row labels with the decoded text.
  • The expected pattern (Whisper's hallmark behavior): monotonic, near-diagonal alignment. The cross-attention "scans" left-to-right across the audio as text is generated. This is the basis of Whisper's timestamp prediction.
  • Save as cross_attention_<NN>.png.

Block F — analysis

In README.md:

  • Did all 24 clips transcribe correctly (WER ≈ 0)?
  • Which clips had the lowest decoder confidence at the verb token? What were the top-5 alternatives at that step?
  • Were the timestamps accurate? Compare against the actual clip duration.
  • Does the cross-attention show monotonic alignment? Pick the cleanest example and the messiest example to discuss.
  • Connect to grammar tutor: if you wanted to flag pronunciation errors (e.g. user says "he work" instead of "he works"), how would you use Whisper's per-token probabilities? (Hint: look at the probability of works vs work at the verb-token position when the audio is "he work".)

Block G — stretch goals

  • Adversarial test. Record 5 clips with deliberate grammar errors ("he work", "you was", "I has"). Does Whisper transcribe them as spoken (preserving the error), or does it "correct" them to grammatical English? This is a known property of large speech models: they often "fix" small grammar errors silently. Report what you find.
  • Cross-attention head specialization. For one clip, plot each of the 6 attention heads separately. Are any heads specialized (e.g. one for boundary detection, one for vowel duration)? This is exploratory and findings vary.
  • Greedy vs beam decode comparison. Generate the same clip with num_beams=1 vs num_beams=5. Are the outputs identical, or does beam search find a different transcription?

Acceptance criteria

  1. BLUEPRINT.md approved.
  2. All 24 clips transcribed; WER reported per clip.
  3. Per-token probabilities dumped for at least 5 representative clips.
  4. Cross-attention heatmaps committed for all 24 clips.
  5. README.md discussion includes the monotonic-alignment finding and at least one surprising or unexpected observation.
  6. manifest.json includes Whisper checkpoint hash, transformers version, librosa version, seed (if relevant for any stochastic decoding), wall-clock for full inference run.

What this lab is intentionally NOT

  • Not Whisper training. We're not retraining. The model is used as-is.
  • Not a comparison of Whisper to other ASR. No wav2vec, no Conformer-RNN-T. One model, deep inspection.
  • Not streaming. Whisper is batch. All 24 clips are < 30 s.
  • Not multilingual. whisper-tiny.en is English-only. The multilingual whisper-tiny (74 M, no .en) has a different vocabulary and behavior.

What you'll have learned

  • The log-mel frontend is concrete: a 30 s clip → (80, 3000) log-mel matrix. The processor does this for you, but now you know what's in input_features.
  • Cross-attention is interpretable in Whisper. The decoder's "look at the audio for the next text token" pattern is visible and roughly monotonic. This is the basis of Whisper's timestamp logic.
  • A 39 M-param model is enough for clean speech in a known language. Most "transcription" tasks don't need GPT-scale; they need 39 M trained on enough audio.
  • The model has language-model priors baked in. It will sometimes "fix" grammar errors silently. This is great for ASR-as-product, bad for grammar-tutor-as-product. The grammar tutor needs a model that transcribes exactly what was said, not a smoothed version.

That last point is the interview-relevant finding: when you choose an ASR for a grammar tutor, you specifically want a model with less language-model prior than Whisper — perhaps wav2vec2 fine-tuned with CTC loss, which doesn't have the autoregressive language-model bias. The right model depends on whether you want a fluent transcript or a verbatim one.


Cross-references

  • theory/02-audio-models.md — the log-mel derivation and Whisper architecture.
  • theory/03-fusion-strategies.md §"unified-token" — note that Whisper is not unified-token; it's a dedicated audio encoder + text decoder. GPT-4o's audio capability is unified-token.
  • docs/phase-15-attention/theory/03-multi-head.md — multi-head attention. Whisper's cross-attention is the "encoder is keys/values, decoder is queries" variant we discussed there.
  • HIRING_PATH.md — "audio gap" line item — closed by completing this lab + the theory.