Read OSS

openai/whisper

6 articles

Prerequisites

01

Whisper's Architecture at a Glance: Navigating OpenAI's Speech Recognition Codebase

A guided tour of the entire Whisper codebase, mapping the 9-module Python package from entry points to model architecture.

02

From Sound Waves to Mel Spectrograms: Whisper's Audio Frontend

Traces the complete audio preprocessing pipeline from raw audio file to the tensor consumed by the encoder.

03

Whisper's Token Language: How Tiktoken Encodes Text, Time, and Task

Explores the tiktoken-based tokenization system that bridges audio and text in Whisper.

04

Inside Whisper's Decoder: Beam Search, Logit Filters, and the KV-Cache

Deep dive into the autoregressive decoding system — the most architecturally rich part of the codebase.

05

The 30-Second Window: Whisper's Transcription Loop and Failure Recovery

Explains the main transcription loop that processes full audio files by sliding a 30-second window with robust failure recovery.

06

Word-Level Timestamps: Cross-Attention Alignment, DTW, and Output Formatting

Covers the word timestamp system built on cross-attention weight extraction and Dynamic Time Warping, plus the output writer hierarchy.