Skip to content

English · Español

02 — Audio Models: log-mel, Whisper, HuBERT, wav2vec 2.0

🇪🇸 El audio no es texto ni imagen. Es una señal 1-D continua que primero hay que convertir en una representación 2-D imagen-like (el espectrograma log-mel) y solo entonces un transformer puede operar sobre ella. Esta sección deriva la cadena: 16 kHz → ventana 25 ms / salto 10 ms → 80-mel × 3000 frames → conv-downsampling → 1500 tokens. Después: Whisper, HuBERT, wav2vec 2.0.

This file answers: how does an audio waveform become a sequence of transformer-ready tokens, and what are the three major training paradigms for audio models?

References: - Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper), 2022. (arXiv:2212.04356) - Hsu et al., HuBERT, IEEE/ACM TASLP 2021. (arXiv:2106.07447) - Baevski et al., wav2vec 2.0, NeurIPS 2020. (arXiv:2006.11477)


Why audio needs a frontend

A raw audio signal at the standard speech-recognition sample rate of 16 kHz is a 1-D tensor of float32 amplitude values. 30 seconds of audio = \(16,000 \cdot 30 = 480,000\) samples.

You cannot drop 480,000 samples directly into a transformer. Reasons:

  1. Sequence length. Attention is \(O(T^2)\). \(T = 480{,}000\) gives \(\sim 2 \cdot 10^{11}\) operations per layer just for the softmax. Even FlashAttention can't make this tractable.
  2. Information density. Most samples are highly correlated with their neighbors (the signal is band-limited). Adjacent samples carry almost the same information. A model with token granularity at the sample level wastes capacity learning trivial smoothness.
  3. Phase invariance. Two recordings of the same word with a 1-ms shift in start time produce very different sample-level waveforms (cosine ↔ −cosine over half a kHz period) but humans hear them as identical. Sample-level models have to learn phase invariance — a wasted budget.

The solution: convert the waveform into a log-mel spectrogram — a 2-D image-like representation where one axis is time (at much coarser resolution) and the other is frequency (perceptually warped). All major speech models since 2018 use this frontend.


Log-mel spectrogram, derived

The chain has four steps. Burn each one to memory.

Step 1 — frame the waveform

Pick a window length and hop length. Whisper uses:

  • Window: 25 ms = \(0.025 \cdot 16{,}000 = 400\) samples.
  • Hop: 10 ms = \(0.010 \cdot 16{,}000 = 160\) samples.

For 30 s of audio (= \(480{,}000\) samples):

\[ T_{\text{frames}} = \left\lfloor \frac{480{,}000 - 400}{160} \right\rfloor + 1 = 2998 + 1 \approx 3000. \]

(Whisper pads to exactly 3000 frames as a fixed-size buffer.)

So we go from 480 k samples to 3 k frames — a 160× compression on the time axis, set by the hop length.

Step 2 — windowed FFT

For each frame, apply a Hann window (smooths discontinuities at frame boundaries; reduces spectral leakage), then take the discrete Fourier transform. With window size 400, the FFT output is 201 complex values (FFT size = next power of 2 = 512; one-sided spectrum = 257; Whisper uses 201 by zeroing the tail).

The magnitude squared \(|X[f]|^2\) is the power spectrum of that frame — energy at each frequency bin.

After step 2: a 2-D tensor of shape \((3000, 201)\) — power at each (time, frequency) cell.

Step 3 — mel-scale projection

The 201 linear-frequency bins are perceptually redundant in the high-frequency region (humans don't distinguish 8 kHz from 8.1 kHz; we do distinguish 100 Hz from 200 Hz). We project to a mel scale: 80 perceptually-spaced bands, dense at low frequency, sparse at high frequency.

Concretely: a fixed (not learned) projection matrix \(\mathbf{M} \in \mathbb{R}^{201 \times 80}\) with overlapping triangular filter weights. The output is:

\[ \mathbf{S}_{\text{mel}} = \mathbf{S}_{\text{power}} \cdot \mathbf{M} \in \mathbb{R}^{3000 \times 80}. \]

Step 4 — log compression

Audio energy spans many orders of magnitude (quiet pause vs. loud vowel). Take \(\log(\mathbf{S}_{\text{mel}} + \epsilon)\). Clip extreme values.

Final output: an \(80 \times 3000\) log-mel spectrogram, treated as a single-channel "image". This is what Whisper's encoder ingests.

Why these specific numbers

  • 16 kHz sample rate captures up to 8 kHz (Nyquist). Human speech is ~80–8000 Hz; 16 kHz is the floor for intelligible speech. (Music needs 44.1 kHz; phone speech is 8 kHz.)
  • 25 ms window is long enough to contain ~1 full pitch period of male voice (fundamental ~100 Hz, period 10 ms; women's voices ~200 Hz). 25 ms gives a stable spectrum.
  • 10 ms hop is shorter than typical phoneme duration (~50 ms), so each phoneme spans ~5 frames — adequate temporal resolution.
  • 80 mel bands is standard since DeepSpeech (2014). Higher (128, 256) marginally helps; lower (40) hurts; 80 is the consensus.

Whisper architecture

Whisper (OpenAI, 2022) is an encoder-decoder transformer on log-mel input, trained end-to-end for multi-task speech recognition + translation + language ID + timestamping. The architecture:

log-mel (80, 3000) input
2 × Conv1D (stride 1 then stride 2) → downsamples to (d_model, 1500)
+ sinusoidal position embedding (fixed, not learned — for the audio side)
N × Transformer encoder block (no causal mask)
audio embeddings (1500 tokens × d_model)
  └──── cross-attention key/value for decoder ──┐
  text tokens (BPE, 51865 vocab) ── M × Transformer decoder block ── next-token logits
                                  learned position embedding (text)

Sizes (from the paper, table 1):

Model params d_model enc layers dec layers heads
tiny 39M 384 4 4 6
base 74M 512 6 6 8
small 244M 768 12 12 12
medium 769M 1024 24 24 16
large-v3 1550M 1280 32 32 20

Lab 02 uses tiny.en (English-only variant, 39 M params). On a CPU like the i5-8250U, transcription of 30 s of audio is ~5–10 s — usable for offline inspection.

Why conv-downsample to 1500?

The full mel input is 3000 frames. After two strided convolutions (stride 1 then stride 2), the time axis halves to 1500. The motivation:

  • Compute. Attention is \(O(T^2)\). Halving \(T\) quarters the attention compute. For 30 s of audio, \(T = 3000\) is large; \(T = 1500\) is more tractable while still preserving phoneme-level resolution (one token per ~20 ms).
  • Inductive bias. Convolutions are the right tool for local time-frequency feature extraction at the input layer. They learn things like "spectral energy increase across 30 ms" that would take a transformer many layers to discover.

Multi-task tokens

Whisper does multiple tasks in a single decoder. The trick: special tokens prepended to the decoder input that select the task.

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> [text tokens] <|endoftext|>

vs.

<|startoftranscript|> <|en|> <|translate|> <|notimestamps|> [text tokens] <|endoftext|>

vs. with timestamps:

<|startoftranscript|> <|en|> <|transcribe|> <|0.00|> the quick brown fox <|2.12|> ... <|endoftext|>

The timestamp tokens are 1500 special tokens encoding 0.00, 0.02, 0.04, ..., 29.98 seconds — one token per 20 ms, matching the 1500-token audio sequence. The model is trained to emit these inline with text. This makes Whisper a streaming-ready (well, batch-streaming) ASR system out of the box.

Lab 02 inspects exactly this: for a 5-second clip of someone saying a verb conjugation, we look at: 1. Which timestamp tokens the model emits and at what positions. 2. The cross-attention pattern from the text-token positions to the 1500 audio tokens — verify that the cross-attention concentrates on the audio region matching the spoken word.


HuBERT and wav2vec 2.0: self-supervised audio pretraining

Whisper trains supervised: 680k hours of audio with text transcripts (web-scraped, weakly supervised). For tasks where transcribed audio is scarce (low-resource languages, domain-specific audio), self-supervised pretraining is the answer.

wav2vec 2.0 (Facebook, 2020)

Architecture:

raw waveform (1-D, 16 kHz)
CNN feature encoder (7 convolutional blocks) → (T', 512)
quantization codebook → discrete tokens (used as targets only)
Transformer encoder (with span-masking on inputs) → (T', d)
Contrastive loss: predict the quantized token at masked positions

The loss is contrastive: at a masked time-step, the model must select the correct quantized representation from a pool of "distractors". This is directly analogous to BERT's masked-language-modeling, but at the audio-frame level with a continuous-to-discrete codebook bridge.

After pretraining: fine-tune with CTC loss on transcribed audio for ASR. Just 100 hours of labeled data is enough to outperform fully-supervised baselines trained on 1000 hours.

HuBERT (Facebook, 2021)

A refinement of wav2vec 2.0. Instead of a learned quantizer, HuBERT uses k-means clustering on MFCC features (and then iteratively re-clusters on the model's own hidden representations) to produce discrete pseudo-labels. The pretraining task becomes: predict the cluster ID of each masked frame.

Why this is better: the quantizer in wav2vec 2.0 is trained jointly with the model, which is a moving target. HuBERT's k-means targets are stable (offline), which makes training more reliable.

HuBERT is the basis for many production speech models (Meta's MMS, AssemblyAI's models, etc.).


Streaming caveats

Whisper, HuBERT, and wav2vec 2.0 are all batch models — they expect the full audio (up to 30 s for Whisper) before producing output. Streaming ASR (output partial transcripts as audio comes in) requires different architecture:

  • Causal encoder. Replace bidirectional encoder attention with causal — encoder block at time \(t\) only sees audio frames \(\le t\). Costs a few % WER (word error rate).
  • Chunked attention. Encoder block sees a fixed-size sliding window of frames. Trades latency for accuracy.
  • Conformer-Transducer (Google). Use a CNN-augmented transformer encoder (Conformer) + an RNN-T (Recurrent Neural Network Transducer) decoder. Standard for production streaming ASR.

Whisper itself is not streaming. People build streaming wrappers around it (chunk audio into 5-s blocks, transcribe each, stitch with overlap) but the underlying model is batch.

Why this matters for the grammar tutor (Phase 32): if the user speaks a full sentence and waits for feedback, batch Whisper is fine. If you want live correction while the user is speaking, you need a streaming model. The X2 labs don't address streaming.


Summary of the audio side

  • Raw audio (480 k samples at 16 kHz for 30 s) cannot go into a transformer directly.
  • Log-mel frontend compresses to 3000 frames × 80 mel-bands. This is the universal audio-input representation.
  • Whisper is an encoder-decoder transformer that ingests log-mel, conv-downsamples to 1500 audio tokens, and decodes text + timestamps. 39 M to 1.5 B params.
  • wav2vec 2.0 / HuBERT are self-supervised, contrastive on quantized frame targets. Best when transcribed audio is scarce.
  • Streaming requires explicit architectural changes (causal / chunked / RNN-T); the X2 models are all batch.

Next: theory/03-fusion-strategies.md shows how vision and audio (and text) get combined into a single multi-modal model.