From Sound Waves to Mel Spectrograms: Whisper's Audio Frontend
Prerequisites
- ›Article 1 (architecture overview)
- ›Basic understanding of frequency domain / Fourier transforms
- ›PyTorch tensor operations
From Sound Waves to Mel Spectrograms: Whisper's Audio Frontend
As we mapped out in Part 1, Whisper's pipeline begins with raw audio and ends with a sequence of feature vectors the decoder can attend to. This article traces that journey in exact detail — from the FFmpeg subprocess that ingests any audio format, through the STFT and mel filterbank computation, to the convolutional stem of the AudioEncoder that halves the temporal resolution. We'll track tensor shapes at every stage, because in signal processing, dimension bugs are the most common and the most painful.
The entire audio frontend lives in two files: whisper/audio.py (157 lines) handles everything up to the mel spectrogram, and the AudioEncoder class in whisper/model.py takes it from there.
Audio Loading via FFmpeg
Whisper doesn't use librosa, soundfile, or any Python audio library for loading. Instead, load_audio() shells out to FFmpeg as a subprocess:
cmd = [
"ffmpeg",
"-nostdin",
"-threads", "0",
"-i", file,
"-f", "s16le",
"-ac", "1",
"-acodec", "pcm_s16le",
"-ar", str(sr),
"-"
]
out = run(cmd, capture_output=True, check=True).stdout
return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
This is a pragmatic decision with several advantages. FFmpeg can decode virtually any audio format (MP3, FLAC, OGG, AAC, WAV, etc.) without Python bindings. The output is raw PCM written to stdout — 16-bit signed integers at 16kHz mono — which is trivially parsed with np.frombuffer. The final division by 32768.0 normalizes to the [-1.0, 1.0] float32 range.
The tradeoff is that FFmpeg must be installed on the system. But for a tool targeting ML practitioners who likely already have it, this is reasonable.
The 30-Second Chunk Design
Six constants at the top of audio.py define the entire temporal structure of the system:
SAMPLE_RATE = 16000
N_FFT = 400
HOP_LENGTH = 160
CHUNK_LENGTH = 30
N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE # 480000
N_FRAMES = exact_div(N_SAMPLES, HOP_LENGTH) # 3000
And three derived constants relate audio frames to model tokens:
N_SAMPLES_PER_TOKEN = HOP_LENGTH * 2 # 320 (stride-2 conv)
FRAMES_PER_SECOND = exact_div(SAMPLE_RATE, HOP_LENGTH) # 100
TOKENS_PER_SECOND = exact_div(SAMPLE_RATE, N_SAMPLES_PER_TOKEN) # 50
The 30-second chunk is the fundamental unit of processing. The model was trained on 30-second audio segments, and the positional embeddings in the encoder are fixed at 1500 positions (3000 mel frames ÷ 2 from the stride-2 convolution). Every piece of audio, regardless of length, is ultimately processed in 30-second windows.
The pad_or_trim() function enforces this constraint. It handles both numpy arrays and PyTorch tensors, zero-padding short clips or truncating long ones to exactly N_SAMPLES (480,000) samples. Notice it works along an arbitrary axis, which is important when operating on mel spectrograms (2D) rather than raw waveforms (1D).
Tip: The
exact_div()utility fromutils.pyis used instead of regular division to catch configuration errors at import time. If the numbers don't divide evenly, something is wrong with the audio constants.
STFT and Mel Spectrogram Computation
The log_mel_spectrogram() function is the core of the audio frontend. It accepts a file path, numpy array, or torch tensor, and returns a log-mel spectrogram ready for the encoder.
The computation proceeds in four stages:
flowchart LR
A["Waveform\n[480000]"] -->|"torch.stft\nN_FFT=400\nHOP=160"| B["Complex STFT\n[201 × 3001]"]
B -->|"abs()² → magnitudes\n(drop last frame)"| C["Power Spectrum\n[201 × 3000]"]
C -->|"mel_filters @\n(matrix multiply)"| D["Mel Spectrum\n[80 × 3000]"]
D -->|"clamp → log10\nmax-normalize → scale"| E["Log-Mel\n[80 × 3000]"]
Let's look at each step in the code:
window = torch.hann_window(N_FFT).to(audio.device)
stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
magnitudes = stft[..., :-1].abs() ** 2
The STFT uses a 400-sample (25ms) Hann window with a 160-sample (10ms) hop, producing (N_FFT/2 + 1) = 201 frequency bins. The [..., :-1] drops the last time frame to get exactly 3000 frames from 480000 samples.
The mel filterbank is loaded from a pre-computed .npz file by mel_filters():
filters = mel_filters(audio.device, n_mels) # [80, 201] or [128, 201]
mel_spec = filters @ magnitudes # [80, 3000]
The @lru_cache on mel_filters() ensures the filterbank is loaded from disk only once. Storing it as a .npz asset instead of computing it at runtime avoids a dependency on librosa — the docstring even shows the exact librosa call used to generate the file.
Finally, the log-mel normalization:
log_spec = torch.clamp(mel_spec, min=1e-10).log10()
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
This three-step normalization (1) prevents log-of-zero, (2) clamps the dynamic range to 80 dB below the peak, and (3) shifts and scales to roughly center the values around zero. The constants 8.0 and 4.0 are training-time choices baked into the model.
Tensor Shape Transformations
Understanding dimension flow is critical for debugging and extending Whisper. Here's the complete pipeline from waveform to encoder output:
flowchart TD
A["Raw Waveform\n<b>[480000]</b>"] --> B["STFT Output\n<b>[201 × 3001]</b>"]
B --> C["Power Magnitudes\n<b>[201 × 3000]</b>"]
C --> D["Mel Spectrogram\n<b>[80 × 3000]</b>\n(or 128 × 3000)"]
D --> E["After conv1\n<b>[d_model × 3000]</b>\nkernel=3, pad=1"]
E --> F["After conv2\n<b>[d_model × 1500]</b>\nkernel=3, stride=2, pad=1"]
F --> G["Permuted + Positional\n<b>[1500 × d_model]</b>"]
G --> H["Transformer Output\n<b>[1500 × d_model]</b>"]
Key dimensions for each model size:
| Model | d_model (n_audio_state) | n_mels | Encoder Layers | Encoder Heads |
|---|---|---|---|---|
| tiny | 384 | 80 | 4 | 6 |
| base | 512 | 80 | 6 | 8 |
| small | 768 | 80 | 12 | 12 |
| medium | 1024 | 80 | 24 | 16 |
| large-v3 | 1280 | 128 | 32 | 20 |
| turbo | 1280 | 128 | 32 | 20 |
Note that large-v3 and turbo use 128 mel bins instead of 80. The mel filterbank asset contains both variants (mel_80 and mel_128), and the n_mels field in ModelDimensions selects the right one.
The AudioEncoder: Conv Stem and Transformer
The AudioEncoder is surprisingly compact:
class AudioEncoder(nn.Module):
def __init__(self, n_mels, n_ctx, n_state, n_head, n_layer):
super().__init__()
self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))
self.blocks = nn.ModuleList(
[ResidualAttentionBlock(n_state, n_head) for _ in range(n_layer)]
)
self.ln_post = LayerNorm(n_state)
The convolutional stem is two layers of 1D convolutions, both with kernel size 3 and GELU activation. The first (conv1) projects from n_mels channels to n_state (the model's hidden dimension) without changing the temporal resolution. The second (conv2) uses stride=2 to halve the sequence from 3000 to 1500 frames. This is the only temporal downsampling in the entire encoder — the Transformer blocks that follow preserve the 1500-frame resolution.
flowchart LR
A["Mel\n[80 × 3000]"] -->|"Conv1d k=3 p=1\nGELU"| B["[d × 3000]"]
B -->|"Conv1d k=3 s=2 p=1\nGELU"| C["[d × 1500]"]
C -->|"permute(0,2,1)"| D["[1500 × d]"]
D -->|"+ sinusoidal pos emb"| E["[1500 × d]"]
E -->|"N × ResidualAttentionBlock\n(self-attention only)"| F["[1500 × d]"]
F -->|"LayerNorm"| G["[1500 × d]"]
A notable design choice: the encoder uses sinusoidal positional embeddings (generated by sinusoids()), while the decoder uses learned positional embeddings (an nn.Parameter). Sinusoidal embeddings are registered as a buffer (not trained), which makes sense for the encoder since the audio context length is always exactly 1500 — there's nothing to learn. The decoder, however, processes variable-length token sequences up to n_text_ctx=448, where learned positions might capture more nuanced patterns.
The ResidualAttentionBlock used in the encoder contains only self-attention (no cross-attention), a feed-forward MLP with 4× expansion, and pre-norm LayerNorm. The encoder blocks don't need cross-attention because there's nothing to cross-attend to — the encoder operates solely on the audio features.
Tip: Whisper wraps
nn.LayerNorm,nn.Linear, andnn.Conv1dwith custom subclasses that cast to float32 for normalization and cast weights to the input's dtype for linear/conv operations. This ensures numerical stability with FP16 inference without requiring explicit mixed-precision management.
From Spectrogram to Encoder: The Full Picture
In transcribe(), the mel spectrogram is computed for the entire audio file upfront (with 30 seconds of padding), stored as a single tensor, and then sliced as the sliding window advances:
mel = log_mel_spectrogram(audio, model.dims.n_mels, padding=N_SAMPLES)
content_frames = mel.shape[-1] - N_FRAMES
This means the STFT is computed only once, not per-window. Each iteration of the transcription loop extracts a 30-second slice, pads it to 3000 frames, and sends it through the encoder. The padding ensures the final window has enough frames even if the audio doesn't end on a 30-second boundary.
The encoder output — a [batch, 1500, d_model] tensor — becomes the key-value context for the decoder's cross-attention layers. This is computed once per 30-second window and reused for every autoregressive decoding step within that window (via the KV-cache, which we'll explore in Article 4).
What's Next
We've traced the full journey from sound wave to encoder feature vector. In Article 3, we'll shift to the text side and examine Whisper's tokenization system — how tiktoken encodes text, how 1501 timestamp tokens encode segment boundaries at 20ms resolution, and how the special token sequence tells the model what language and task to perform.