Whisper's Architecture at a Glance: Navigating OpenAI's Speech Recognition Codebase
Prerequisites
- ›Basic Python proficiency
- ›General awareness of neural network concepts
- ›Familiarity with PyTorch nn.Module basics
Whisper's Architecture at a Glance: Navigating OpenAI's Speech Recognition Codebase
OpenAI's Whisper is one of those rare open-source projects that punches well above its weight in terms of code-to-capability ratio. The entire system — multilingual speech recognition, translation, language detection, and word-level timestamps — fits in roughly 2,500 lines of Python spread across just nine modules. No sprawling abstractions, no plugin architecture, no configuration DSL. Just a tight encoder-decoder Transformer with a well-designed pipeline around it.
This article is the first in a six-part series dissecting every meaningful line in the Whisper codebase. By the end of this piece, you'll have a mental map of the entire project and know exactly which file to open for any question about the system.
Project Structure and Module Map
The Whisper package is flat — there's no nested subpackage hierarchy, no core/ vs utils/ vs services/ sprawl. Every module lives at the top level of the whisper/ directory, and each has a single, clear responsibility.
| Module | Lines (approx.) | Responsibility |
|---|---|---|
__init__.py |
162 | Model registry, download, load_model() API |
model.py |
345 | ModelDimensions, Whisper, encoder/decoder architecture |
audio.py |
157 | Audio loading, mel spectrogram computation |
tokenizer.py |
395 | Tiktoken wrapper, special tokens, language support |
decoding.py |
826 | Autoregressive decoding, beam search, logit filters |
transcribe.py |
623 | Sliding-window transcription loop, CLI |
timing.py |
389 | Word timestamps via cross-attention + DTW |
utils.py |
318 | Output writers, formatting, compression ratio |
triton_ops.py |
118 | GPU-accelerated DTW and median filter kernels |
There's also an assets/ directory holding pre-computed mel filterbanks (mel_filters.npz) and BPE vocabularies (gpt2.tiktoken, multilingual.tiktoken), a normalizers/ subpackage for text normalization, and a version.py for version tracking.
flowchart TD
subgraph "whisper/ package"
INIT["__init__.py\n(model registry + load)"]
MODEL["model.py\n(Whisper nn.Module)"]
AUDIO["audio.py\n(mel spectrograms)"]
TOKEN["tokenizer.py\n(tiktoken wrapper)"]
DECODE["decoding.py\n(beam search + filters)"]
TRANS["transcribe.py\n(sliding window + CLI)"]
TIMING["timing.py\n(word timestamps)"]
UTILS["utils.py\n(output writers)"]
TRITON["triton_ops.py\n(GPU kernels)"]
end
INIT --> MODEL
INIT --> AUDIO
INIT --> DECODE
INIT --> TRANS
TRANS --> AUDIO
TRANS --> DECODE
TRANS --> TIMING
TRANS --> TOKEN
DECODE --> TOKEN
TIMING --> TRITON
TIMING --> MODEL
MODEL --> DECODE
MODEL --> TRANS
Tip: The dependency arrows between
model.py,decoding.py, andtranscribe.pyare circular by design. Whisper resolves this through a clever method-binding pattern we'll examine shortly — it's one of the most interesting structural decisions in the codebase.
Entry Points: CLI and Python API
Whisper offers two convergent entry points, and both ultimately call the same transcribe() function.
The CLI entry point is declared in pyproject.toml#L35:
scripts.whisper = "whisper.transcribe:cli"
This means running whisper from the command line invokes cli() in transcribe.py. You can also run python -m whisper, which hits whisper/__main__.py — a two-line file that simply imports and calls cli().
The Python API entry point is whisper.load_model(), exported from whisper/__init__.py. After loading a model, you call model.transcribe(audio), which invokes the same underlying function.
flowchart LR
A["$ whisper audio.mp3"] --> B["transcribe.py:cli()"]
C["$ python -m whisper"] --> D["__main__.py"] --> B
E["model.transcribe(audio)"] --> F["transcribe.py:transcribe()"]
B --> G["load_model()"] --> F
Both paths converge on transcribe(), which is the heart of the system. The CLI adds argparse scaffolding, model loading, and output writing around it; the Python API exposes it directly as a method on the Whisper class.
Model Registry and Download System
Whisper ships 12 model variants, from the 39M-parameter tiny to the 1.55B-parameter large-v3. The model registry is a plain dictionary mapping model names to Azure CDN URLs in whisper/__init__.py#L17-L32:
_MODELS = {
"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
# ...
"turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
}
Notice how the SHA256 hash is embedded in the URL path itself — the penultimate path segment is the expected checksum. The _download() function at line 54 extracts it with url.split("/")[-2] and verifies the downloaded bytes against it. This is an elegant trick: no separate checksums file, no manifest — the URL is the integrity record.
Alongside the model URLs, there's _ALIGNMENT_HEADS at lines 36-51 — base85-encoded, gzip-compressed boolean arrays identifying which cross-attention heads correlate best with word-level timing. These are decoded in model.py's set_alignment_heads() method and used during word timestamp extraction (covered in Article 6).
The load_model() function at lines 103-161 handles three cases: loading a named model from the registry, loading from a local file path, or failing with a helpful error. It deserializes the checkpoint, constructs ModelDimensions from the saved dims dict, instantiates the Whisper class, loads weights, and configures alignment heads. The model is cached to ~/.cache/whisper by default (respecting XDG_CACHE_HOME).
The Whisper Class and ModelDimensions
The entire architecture is parameterized by a single 10-field dataclass: ModelDimensions.
@dataclass
class ModelDimensions:
n_mels: int # mel frequency bins (80 or 128)
n_audio_ctx: int # audio context length (1500)
n_audio_state: int # encoder hidden size
n_audio_head: int # encoder attention heads
n_audio_layer: int # encoder layers
n_vocab: int # vocabulary size
n_text_ctx: int # text context length (448)
n_text_state: int # decoder hidden size
n_text_head: int # decoder attention heads
n_text_layer: int # decoder layers
These 10 numbers fully specify every dimension in the model. The Whisper class at lines 252-345 composes an AudioEncoder and TextDecoder from these dimensions:
classDiagram
class ModelDimensions {
+int n_mels
+int n_audio_ctx
+int n_audio_state
+int n_audio_head
+int n_audio_layer
+int n_vocab
+int n_text_ctx
+int n_text_state
+int n_text_head
+int n_text_layer
}
class Whisper {
+ModelDimensions dims
+AudioEncoder encoder
+TextDecoder decoder
+embed_audio(mel)
+logits(tokens, audio_features)
+detect_language()
+transcribe()
+decode()
}
class AudioEncoder {
+Conv1d conv1
+Conv1d conv2
+Tensor positional_embedding
+ModuleList blocks
}
class TextDecoder {
+Embedding token_embedding
+Parameter positional_embedding
+ModuleList blocks
}
Whisper --> ModelDimensions
Whisper --> AudioEncoder
Whisper --> TextDecoder
The most architecturally interesting detail is at the very bottom of model.py — lines 343-345:
detect_language = detect_language_function
transcribe = transcribe_function
decode = decode_function
These lines bind module-level functions from decoding.py and transcribe.py as methods on the Whisper class. This is how Whisper avoids circular imports: model.py imports decode and transcribe as functions (imported at the top of the file), then assigns them as class attributes. The decoding.py and transcribe.py modules reference the Whisper type only under TYPE_CHECKING, so there's no runtime circular dependency.
When you call model.transcribe(audio), Python passes the model instance as self — which becomes the model parameter in the transcribe() function. It's simple, it works, and it avoids the complexity of a plugin system.
End-to-End Data Flow Overview
Before diving deep in the coming articles, let's trace the full path from audio file to text output:
flowchart TD
A["Audio File\n(any format)"] -->|"ffmpeg subprocess"| B["Raw Waveform\n16kHz mono float32"]
B -->|"torch.stft + mel filterbank"| C["Log-Mel Spectrogram\n[n_mels × frames]"]
C -->|"pad_or_trim to 3000 frames"| D["30-Second Chunk\n[n_mels × 3000]"]
D -->|"AudioEncoder\n(conv stem + transformer)"| E["Audio Features\n[1500 × d_model]"]
E -->|"TextDecoder\n(autoregressive)"| F["Token Sequence\nwith timestamps"]
F -->|"Tokenizer.decode()"| G["Text Segments\nwith timing"]
G -->|"Output Writers"| H["TXT / SRT / VTT / JSON / TSV"]
style A fill:#f9f,stroke:#333
style H fill:#9f9,stroke:#333
Here's how the key modules map to each stage:
-
Audio loading (
audio.py): FFmpeg decodes any audio format to raw PCM. The entire audio is converted to a mel spectrogram up front. -
Sliding window (
transcribe.py): The spectrogram is processed in 30-second chunks. Each chunk is padded/trimmed to exactly 3000 frames and fed to the encoder. -
Encoding (
model.py→AudioEncoder): Two 1D convolutions (the second with stride 2) halve the temporal dimension to 1500, then a Transformer stack processes the features. -
Decoding (
decoding.py): Autoregressive token generation with logit filtering, temperature fallback, and optional beam search. Timestamp tokens mark segment boundaries. -
Word timestamps (
timing.py): Cross-attention weights are extracted, processed with a median filter, and aligned to tokens via Dynamic Time Warping. -
Output (
utils.py): A writer hierarchy formats the results as plain text, subtitles (SRT/VTT), TSV, or JSON.
Tip: The 30-second chunk size isn't arbitrary — it's the maximum duration the model was trained on. The constants in
audio.py(SAMPLE_RATE=16000, CHUNK_LENGTH=30, N_SAMPLES=480000, N_FRAMES=3000) are all derived from this fundamental design choice.
What's Next
With the map in hand, we're ready to go deep. In Article 2, we'll trace the audio preprocessing pipeline in detail — from FFmpeg subprocess invocation through STFT computation to the convolutional encoder stem, tracking tensor dimensions at every stage. You'll see exactly how a sound wave becomes the 1500-frame feature sequence that the decoder attends to.