Read OSS

The 30-Second Window: Whisper's Transcription Loop and Failure Recovery

Advanced

Prerequisites

  • Articles 1-4
  • Understanding of the decoding system

The 30-Second Window: Whisper's Transcription Loop and Failure Recovery

The decoder we examined in Article 4 operates on exactly one 30-second mel spectrogram segment. Real audio, of course, can be any length — a 3-second voice memo or a 2-hour podcast. The transcribe() function bridges this gap with a sliding-window loop that stitches decoded segments together, handles decoding failures gracefully, and detects hallucinations. At ~475 lines, it's the second-largest function in the codebase and arguably the most important for real-world usage.

Transcription Function Setup

The function signature at lines 38-56 reveals the full set of controls available to users:

def transcribe(
    model: "Whisper",
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    carry_initial_prompt: bool = False,
    word_timestamps: bool = False,
    # ...
    **decode_options,
):

The initialization at lines 127-178 does several things:

  1. Computes the mel spectrogram for the entire audio with 30 seconds of padding
  2. Detects the language from the first 30 seconds (if not specified)
  3. Creates the tokenizer
  4. Parses clip_timestamps into seek points
flowchart TD
    A["transcribe() called"] --> B["Compute mel for full audio\n+ 30s padding"]
    B --> C{"Language\nspecified?"}
    C -->|No| D["Detect from first 30s\nusing model.detect_language()"]
    C -->|Yes| E["Use specified language"]
    D --> F["Create tokenizer"]
    E --> F
    F --> G["Parse clip_timestamps\ninto seek_clips"]
    G --> H["Enter sliding window loop"]

The clip_timestamps feature allows processing only specific portions of the audio — useful for pre-segmented content or skipping known non-speech regions. The timestamps are converted to frame-level seek points, paired as (start, end) tuples.

Temperature Fallback Strategy

The decode_with_fallback() inner function implements Whisper's primary failure recovery mechanism. The default temperature tuple (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) means: try greedy decoding first, and if the output looks degenerate, retry with progressively more randomness.

flowchart TD
    A["Try T=0.0\n(greedy)"] --> B{"Check quality"}
    B -->|"compression_ratio > 2.4\n(repetitive)"| C["Try T=0.2"]
    B -->|"avg_logprob < -1.0\n(low confidence)"| C
    B -->|"no_speech + low logprob\n(silence)"| D["Accept as silence"]
    B -->|"Passes all checks"| E["Accept result"]
    C --> F{"Check quality"}
    F -->|"Still failing"| G["Try T=0.4, 0.6, ..."]
    F -->|"Passes"| E
    G --> H["Accept last result\n(even if poor)"]

Two quality metrics drive the fallback:

  • Compression ratio (via compression_ratio() in utils.py): len(text_bytes) / len(zlib.compress(text_bytes)). A ratio above 2.4 indicates repetitive text — the model is stuck in a loop generating the same phrase over and over. This is the single most common failure mode of autoregressive models.

  • Average log probability: If below -1.0, the model is uncertain about its output. Low confidence combined with high no-speech probability indicates silence rather than a decoding failure.

The interplay between these thresholds is subtle. If no_speech_prob > no_speech_threshold and avg_logprob < logprob_threshold, the segment is treated as silence and the fallback is not triggered. This prevents the system from wasting compute retrying on genuinely silent sections.

Tip: When temperature > 0, beam search is automatically disabled (beam_size and patience are popped from kwargs). Similarly, best_of sampling is disabled at temperature == 0. This ensures compatible decoding strategies.

The Sliding Window and Seek Management

The main loop at lines 272-399 is the heart of transcribe(). The variable seek tracks the current position in mel frames, and advances by different amounts depending on the decoded output.

The code comments acknowledge this directly: "This loop is obscurely flattened to make the diff readable." It's a while loop over clip_idx and seek that logically represents nested iteration over clips and windows.

Each iteration:

  1. Extracts a mel segment at the current seek position
  2. Pads it to exactly N_FRAMES (3000) frames
  3. Sets up the prompt from previous context
  4. Calls decode_with_fallback()
  5. Parses timestamp tokens to determine segment boundaries
  6. Advances seek

The seek advancement logic has two cases based on whether the decoder produced consecutive timestamp tokens:

Case 1: Consecutive timestamps found (the common case). The output is sliced at each pair of consecutive timestamps. Each slice becomes a segment with precise start/end times. If the sequence ends with a single timestamp (not paired), it means "no speech after this point" and seek advances by the full segment_size. Otherwise, seek advances to the position of the last timestamp.

Case 2: No consecutive timestamps. The entire output becomes a single segment. If there's any timestamp token at all, its position determines the duration; otherwise, the full segment_duration is used. Seek always advances by segment_size.

This distinction matters because timestamp tokens provide sub-window precision. Instead of always jumping 30 seconds forward, the loop can advance by the exact amount of decoded audio, avoiding gaps or overlaps.

No-Speech Detection and Prompt Conditioning

The no-speech check at lines 298-310 is a fast-path skip:

if no_speech_threshold is not None:
    should_skip = result.no_speech_prob > no_speech_threshold
    if logprob_threshold is not None and result.avg_logprob > logprob_threshold:
        should_skip = False  # high confidence overrides no-speech
    if should_skip:
        seek += segment_size
        continue

The logic: if the model thinks there's no speech (no_speech_prob > 0.6) and the average log probability is low (confirming uncertainty), skip the segment. But if the logprob is high despite a high no-speech probability, trust the transcription — the model might be picking up quiet but clear speech.

Prompt conditioning across windows is managed at lines 503-505:

if not condition_on_previous_text or result.temperature > 0.5:
    prompt_reset_since = len(all_tokens)

When condition_on_previous_text is enabled (the default), all previously decoded tokens are passed as a prompt to the next window. This helps maintain consistency — proper nouns, stylistic choices, and context carry forward. But when a high temperature was needed (above 0.5), the prompt is reset, because high-temperature output is likely unreliable context.

The carry_initial_prompt option at lines 288-293 provides persistent context that survives prompt resets. When enabled, the initial_prompt tokens are always prepended, ensuring domain-specific vocabulary or formatting instructions persist across the entire transcription.

Hallucination Detection and Recovery

When word_timestamps is enabled, transcribe() activates hallucination detection at lines 419-472. The hallucination_silence_threshold parameter controls this behavior.

The system uses two local functions to score segments:

def word_anomaly_score(word: dict) -> float:
    probability = word.get("probability", 0.0)
    duration = word["end"] - word["start"]
    score = 0.0
    if probability < 0.15:
        score += 1.0
    if duration < 0.133:
        score += (0.133 - duration) * 15
    if duration > 2.0:
        score += duration - 2.0
    return score

A word is "anomalous" if it has low probability (< 0.15), is suspiciously short (< 133ms), or suspiciously long (> 2s). The is_segment_anomaly() function sums these scores across the first 8 non-punctuation words and flags the segment if the total exceeds 3 or nearly equals the word count.

flowchart TD
    A["Segment decoded\nwith word timestamps"] --> B{"is_segment_anomaly?"}
    B -->|No| C["Keep segment"]
    B -->|Yes| D{"Surrounded by\nsilence?"}
    D -->|Yes| E["Skip hallucinated\nsegment, re-seek"]
    D -->|No| C
    
    F["First segment\nanomaly?"] --> G{"Gap before\n> threshold?"}
    G -->|Yes| H["Skip leading silence\nre-seek to gap"]
    G -->|No| C

The recovery strategy is position-aware. If the first segment in a window is anomalous and there's a gap before it exceeding the threshold, the loop re-seeks to just after the gap — skipping the silence that likely triggered the hallucination. For mid-window hallucinations, the system checks for surrounding silence and truncates the segment list, re-seeking to the anomaly's start position.

This is a heuristic system — it's not perfect, but it catches the most common hallucination pattern: the model generating plausible-sounding text during silent or noisy sections.

The CLI Interface

The cli() function wraps transcribe() with argument parsing and output writing. The argparse setup at lines 528-567 maps every transcribe() parameter to a command-line flag.

Notable CLI behaviors:

  • Temperature increment: Rather than specifying the full tuple, users set --temperature (default 0) and --temperature_increment_on_fallback (default 0.2), and the CLI generates the tuple with np.arange(temperature, 1.0 + 1e-6, increment).
  • Multi-file processing: The audio argument accepts multiple files. Each is transcribed independently with error handling that skips failures rather than aborting.
  • Output format: The --output_format flag defaults to "all", which invokes every writer (TXT, VTT, SRT, TSV, JSON) via the composite pattern in get_writer().

Tip: English-only models (.en suffix) automatically force language="en" even if a different language is specified, with a warning. This prevents confusing behavior when using the wrong model for a language.

What's Next

With the transcription loop understood, one major subsystem remains: word-level timestamps. In Article 6, we'll explore how Whisper extracts cross-attention weights, aligns them to tokens using Dynamic Time Warping, and formats the results into word-highlighted subtitles — including the Triton kernel that generates its own source code at runtime.