Read OSS

Whisper's Architecture at a Glance: Navigating OpenAI's Speech Recognition Codebase

Intermediate

Prerequisites

  • Basic Python proficiency
  • General awareness of neural network concepts
  • Familiarity with PyTorch nn.Module basics

Whisper's Architecture at a Glance: Navigating OpenAI's Speech Recognition Codebase

OpenAI's Whisper is one of those rare open-source projects that punches well above its weight in terms of code-to-capability ratio. The entire system — multilingual speech recognition, translation, language detection, and word-level timestamps — fits in roughly 2,500 lines of Python spread across just nine modules. No sprawling abstractions, no plugin architecture, no configuration DSL. Just a tight encoder-decoder Transformer with a well-designed pipeline around it.

This article is the first in a six-part series dissecting every meaningful line in the Whisper codebase. By the end of this piece, you'll have a mental map of the entire project and know exactly which file to open for any question about the system.

Project Structure and Module Map

The Whisper package is flat — there's no nested subpackage hierarchy, no core/ vs utils/ vs services/ sprawl. Every module lives at the top level of the whisper/ directory, and each has a single, clear responsibility.

Module Lines (approx.) Responsibility
__init__.py 162 Model registry, download, load_model() API
model.py 345 ModelDimensions, Whisper, encoder/decoder architecture
audio.py 157 Audio loading, mel spectrogram computation
tokenizer.py 395 Tiktoken wrapper, special tokens, language support
decoding.py 826 Autoregressive decoding, beam search, logit filters
transcribe.py 623 Sliding-window transcription loop, CLI
timing.py 389 Word timestamps via cross-attention + DTW
utils.py 318 Output writers, formatting, compression ratio
triton_ops.py 118 GPU-accelerated DTW and median filter kernels

There's also an assets/ directory holding pre-computed mel filterbanks (mel_filters.npz) and BPE vocabularies (gpt2.tiktoken, multilingual.tiktoken), a normalizers/ subpackage for text normalization, and a version.py for version tracking.

flowchart TD
    subgraph "whisper/ package"
        INIT["__init__.py\n(model registry + load)"]
        MODEL["model.py\n(Whisper nn.Module)"]
        AUDIO["audio.py\n(mel spectrograms)"]
        TOKEN["tokenizer.py\n(tiktoken wrapper)"]
        DECODE["decoding.py\n(beam search + filters)"]
        TRANS["transcribe.py\n(sliding window + CLI)"]
        TIMING["timing.py\n(word timestamps)"]
        UTILS["utils.py\n(output writers)"]
        TRITON["triton_ops.py\n(GPU kernels)"]
    end

    INIT --> MODEL
    INIT --> AUDIO
    INIT --> DECODE
    INIT --> TRANS
    TRANS --> AUDIO
    TRANS --> DECODE
    TRANS --> TIMING
    TRANS --> TOKEN
    DECODE --> TOKEN
    TIMING --> TRITON
    TIMING --> MODEL
    MODEL --> DECODE
    MODEL --> TRANS

Tip: The dependency arrows between model.py, decoding.py, and transcribe.py are circular by design. Whisper resolves this through a clever method-binding pattern we'll examine shortly — it's one of the most interesting structural decisions in the codebase.

Entry Points: CLI and Python API

Whisper offers two convergent entry points, and both ultimately call the same transcribe() function.

The CLI entry point is declared in pyproject.toml#L35:

scripts.whisper = "whisper.transcribe:cli"

This means running whisper from the command line invokes cli() in transcribe.py. You can also run python -m whisper, which hits whisper/__main__.py — a two-line file that simply imports and calls cli().

The Python API entry point is whisper.load_model(), exported from whisper/__init__.py. After loading a model, you call model.transcribe(audio), which invokes the same underlying function.

flowchart LR
    A["$ whisper audio.mp3"] --> B["transcribe.py:cli()"]
    C["$ python -m whisper"] --> D["__main__.py"] --> B
    E["model.transcribe(audio)"] --> F["transcribe.py:transcribe()"]
    B --> G["load_model()"] --> F

Both paths converge on transcribe(), which is the heart of the system. The CLI adds argparse scaffolding, model loading, and output writing around it; the Python API exposes it directly as a method on the Whisper class.

Model Registry and Download System

Whisper ships 12 model variants, from the 39M-parameter tiny to the 1.55B-parameter large-v3. The model registry is a plain dictionary mapping model names to Azure CDN URLs in whisper/__init__.py#L17-L32:

_MODELS = {
    "tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
    # ...
    "turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
}

Notice how the SHA256 hash is embedded in the URL path itself — the penultimate path segment is the expected checksum. The _download() function at line 54 extracts it with url.split("/")[-2] and verifies the downloaded bytes against it. This is an elegant trick: no separate checksums file, no manifest — the URL is the integrity record.

Alongside the model URLs, there's _ALIGNMENT_HEADS at lines 36-51 — base85-encoded, gzip-compressed boolean arrays identifying which cross-attention heads correlate best with word-level timing. These are decoded in model.py's set_alignment_heads() method and used during word timestamp extraction (covered in Article 6).

The load_model() function at lines 103-161 handles three cases: loading a named model from the registry, loading from a local file path, or failing with a helpful error. It deserializes the checkpoint, constructs ModelDimensions from the saved dims dict, instantiates the Whisper class, loads weights, and configures alignment heads. The model is cached to ~/.cache/whisper by default (respecting XDG_CACHE_HOME).

The Whisper Class and ModelDimensions

The entire architecture is parameterized by a single 10-field dataclass: ModelDimensions.

@dataclass
class ModelDimensions:
    n_mels: int          # mel frequency bins (80 or 128)
    n_audio_ctx: int     # audio context length (1500)
    n_audio_state: int   # encoder hidden size
    n_audio_head: int    # encoder attention heads
    n_audio_layer: int   # encoder layers
    n_vocab: int         # vocabulary size
    n_text_ctx: int      # text context length (448)
    n_text_state: int    # decoder hidden size
    n_text_head: int     # decoder attention heads
    n_text_layer: int    # decoder layers

These 10 numbers fully specify every dimension in the model. The Whisper class at lines 252-345 composes an AudioEncoder and TextDecoder from these dimensions:

classDiagram
    class ModelDimensions {
        +int n_mels
        +int n_audio_ctx
        +int n_audio_state
        +int n_audio_head
        +int n_audio_layer
        +int n_vocab
        +int n_text_ctx
        +int n_text_state
        +int n_text_head
        +int n_text_layer
    }

    class Whisper {
        +ModelDimensions dims
        +AudioEncoder encoder
        +TextDecoder decoder
        +embed_audio(mel)
        +logits(tokens, audio_features)
        +detect_language()
        +transcribe()
        +decode()
    }

    class AudioEncoder {
        +Conv1d conv1
        +Conv1d conv2
        +Tensor positional_embedding
        +ModuleList blocks
    }

    class TextDecoder {
        +Embedding token_embedding
        +Parameter positional_embedding
        +ModuleList blocks
    }

    Whisper --> ModelDimensions
    Whisper --> AudioEncoder
    Whisper --> TextDecoder

The most architecturally interesting detail is at the very bottom of model.pylines 343-345:

detect_language = detect_language_function
transcribe = transcribe_function
decode = decode_function

These lines bind module-level functions from decoding.py and transcribe.py as methods on the Whisper class. This is how Whisper avoids circular imports: model.py imports decode and transcribe as functions (imported at the top of the file), then assigns them as class attributes. The decoding.py and transcribe.py modules reference the Whisper type only under TYPE_CHECKING, so there's no runtime circular dependency.

When you call model.transcribe(audio), Python passes the model instance as self — which becomes the model parameter in the transcribe() function. It's simple, it works, and it avoids the complexity of a plugin system.

End-to-End Data Flow Overview

Before diving deep in the coming articles, let's trace the full path from audio file to text output:

flowchart TD
    A["Audio File\n(any format)"] -->|"ffmpeg subprocess"| B["Raw Waveform\n16kHz mono float32"]
    B -->|"torch.stft + mel filterbank"| C["Log-Mel Spectrogram\n[n_mels × frames]"]
    C -->|"pad_or_trim to 3000 frames"| D["30-Second Chunk\n[n_mels × 3000]"]
    D -->|"AudioEncoder\n(conv stem + transformer)"| E["Audio Features\n[1500 × d_model]"]
    E -->|"TextDecoder\n(autoregressive)"| F["Token Sequence\nwith timestamps"]
    F -->|"Tokenizer.decode()"| G["Text Segments\nwith timing"]
    G -->|"Output Writers"| H["TXT / SRT / VTT / JSON / TSV"]

    style A fill:#f9f,stroke:#333
    style H fill:#9f9,stroke:#333

Here's how the key modules map to each stage:

  1. Audio loading (audio.py): FFmpeg decodes any audio format to raw PCM. The entire audio is converted to a mel spectrogram up front.

  2. Sliding window (transcribe.py): The spectrogram is processed in 30-second chunks. Each chunk is padded/trimmed to exactly 3000 frames and fed to the encoder.

  3. Encoding (model.pyAudioEncoder): Two 1D convolutions (the second with stride 2) halve the temporal dimension to 1500, then a Transformer stack processes the features.

  4. Decoding (decoding.py): Autoregressive token generation with logit filtering, temperature fallback, and optional beam search. Timestamp tokens mark segment boundaries.

  5. Word timestamps (timing.py): Cross-attention weights are extracted, processed with a median filter, and aligned to tokens via Dynamic Time Warping.

  6. Output (utils.py): A writer hierarchy formats the results as plain text, subtitles (SRT/VTT), TSV, or JSON.

Tip: The 30-second chunk size isn't arbitrary — it's the maximum duration the model was trained on. The constants in audio.py (SAMPLE_RATE=16000, CHUNK_LENGTH=30, N_SAMPLES=480000, N_FRAMES=3000) are all derived from this fundamental design choice.

What's Next

With the map in hand, we're ready to go deep. In Article 2, we'll trace the audio preprocessing pipeline in detail — from FFmpeg subprocess invocation through STFT computation to the convolutional encoder stem, tracking tensor dimensions at every stage. You'll see exactly how a sound wave becomes the 1500-frame feature sequence that the decoder attends to.