openai/whisper
6 articles
Prerequisites
- ›Basic Python proficiency
- ›General awareness of neural network concepts
- ›Familiarity with PyTorch nn.Module basics
01
Whisper's Architecture at a Glance: Navigating OpenAI's Speech Recognition Codebase
A guided tour of the entire Whisper codebase, mapping the 9-module Python package from entry points to model architecture.
02
From Sound Waves to Mel Spectrograms: Whisper's Audio Frontend
Traces the complete audio preprocessing pipeline from raw audio file to the tensor consumed by the encoder.
03
Whisper's Token Language: How Tiktoken Encodes Text, Time, and Task
Explores the tiktoken-based tokenization system that bridges audio and text in Whisper.
04
Inside Whisper's Decoder: Beam Search, Logit Filters, and the KV-Cache
Deep dive into the autoregressive decoding system — the most architecturally rich part of the codebase.
05
The 30-Second Window: Whisper's Transcription Loop and Failure Recovery
Explains the main transcription loop that processes full audio files by sliding a 30-second window with robust failure recovery.
06
Word-Level Timestamps: Cross-Attention Alignment, DTW, and Output Formatting
Covers the word timestamp system built on cross-attention weight extraction and Dynamic Time Warping, plus the output writer hierarchy.