openai/whisper

6 articles

Prerequisites

A guided tour of the entire Whisper codebase, mapping the 9-module Python package from entry points to model architecture.

Traces the complete audio preprocessing pipeline from raw audio file to the tensor consumed by the encoder.

Explores the tiktoken-based tokenization system that bridges audio and text in Whisper.

Deep dive into the autoregressive decoding system — the most architecturally rich part of the codebase.

Explains the main transcription loop that processes full audio files by sliding a 30-second window with robust failure recovery.

Covers the word timestamp system built on cross-attention weight extraction and Dynamic Time Warping, plus the output writer hierarchy.