Read OSS

Whisper's Token Language: How Tiktoken Encodes Text, Time, and Task

Intermediate

Prerequisites

  • Articles 1-2
  • Familiarity with BPE tokenization concepts

Whisper's Token Language: How Tiktoken Encodes Text, Time, and Task

Whisper's tokenizer does more than encode text — it's a complete protocol for communicating task instructions and temporal boundaries to the model. Through a carefully designed set of special tokens, a single decoder can transcribe, translate, detect languages, and mark precise segment timestamps. All of this is orchestrated by whisper/tokenizer.py, a 395-line module built on OpenAI's tiktoken library.

As we saw in the architecture overview, the tokenizer sits between the decoding system and the model, defining the vocabulary that the TextDecoder's embedding layer maps to hidden states. Understanding the token structure is essential before we can make sense of the decoding logic in Article 4.

Dual Vocabularies and Encoding Setup

Whisper ships two BPE vocabularies as tiktoken files in whisper/assets/:

  • gpt2.tiktoken — The GPT-2 vocabulary used by English-only models (.en suffix). Contains ~50,257 base tokens.
  • multilingual.tiktoken — An expanded vocabulary for multilingual models. Contains ~50,364 base tokens.

The get_encoding() function constructs a tiktoken Encoding object from these files. It's decorated with @lru_cache(maxsize=None), ensuring the vocabulary is loaded and parsed only once per process:

@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99):
    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
    ranks = {
        base64.b64decode(token): int(rank)
        for token, rank in (line.split() for line in open(vocab_path) if line)
    }
    n_vocab = len(ranks)

After loading the base vocabulary, the function programmatically constructs the special token set — and this is where things get interesting.

Special Token Architecture

The special tokens are built dynamically in get_encoding(). Their IDs are assigned sequentially starting after the base vocabulary:

specials = [
    "<|endoftext|>",
    "<|startoftranscript|>",
    *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
    "<|translate|>",
    "<|transcribe|>",
    "<|startoflm|>",
    "<|startofprev|>",
    "<|nospeech|>",
    "<|notimestamps|>",
    *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
]

This single list defines the complete special token vocabulary. The structure is:

flowchart TD
    A["Base BPE Vocabulary\n~50k tokens"] --> B["<|endoftext|>\n(EOT)"]
    B --> C["<|startoftranscript|>\n(SOT)"]
    C --> D["Language Tokens\n<|en|>, <|zh|>, ... <|yue|>\n(up to 99 languages)"]
    D --> E["Task Tokens\n<|translate|>, <|transcribe|>"]
    E --> F["Control Tokens\n<|startoflm|>, <|startofprev|>\n<|nospeech|>, <|notimestamps|>"]
    F --> G["Timestamp Tokens\n<|0.00|> through <|30.00|>\n(1501 tokens, 0.02s resolution)"]

The LANGUAGES dictionary maps 100 ISO language codes to their English names — from "en" (English) to "yue" (Cantonese). The num_languages parameter controls how many are included. This matters because different model versions support different language counts, and the vocabulary size determines whether a model is_multilingual (recall from Article 1: self.dims.n_vocab >= 51865).

Tip: The TO_LANGUAGE_CODE dictionary at lines 114-128 includes aliases like "burmese""my", "mandarin""zh", and "castilian""es". This lets users pass language names in natural forms rather than memorizing ISO codes.

Timestamp Tokens and Segment Boundaries

The 1501 timestamp tokens are perhaps Whisper's most distinctive tokenization feature. They encode time positions within the 30-second chunk at 20ms resolution:

  • <|0.00|> = beginning of chunk
  • <|0.02|> = 20ms
  • <|0.04|> = 40ms
  • ...
  • <|30.00|> = end of chunk

These tokens appear inline with the text tokens during decoding. The decoder generates them as part of its output sequence, using consecutive timestamp pairs to mark segment boundaries:

sequenceDiagram
    participant D as Decoder Output
    Note over D: <|0.00|>
    Note over D: "Hello world."
    Note over D: <|2.56|>
    Note over D: <|2.56|>
    Note over D: "How are you?"
    Note over D: <|5.12|>
    Note over D: <|endoftext|>

The pattern is: <|start_time|> text tokens <|end_time|> <|start_time|> text tokens <|end_time|> .... When two consecutive tokens are both timestamps, they mark a segment boundary — the end of one segment and the start of the next. The ApplyTimestampRules logit filter (covered in Article 4) enforces this grammar: timestamps must appear in pairs, they must be monotonically increasing, and at the very first decoding step only timestamp tokens are allowed.

The 20ms resolution comes directly from the model's temporal structure: each audio token represents 320 audio samples at 16kHz = 20ms (as derived by N_SAMPLES_PER_TOKEN = HOP_LENGTH * 2 in audio.py). So each timestamp token maps exactly to one position in the encoder's output sequence.

The SOT Sequence and Token Ordering

The Tokenizer dataclass at lines 131-166 wraps the tiktoken encoding and provides convenient access to special tokens. Its __post_init__ method constructs the sot_sequence — the start-of-transcript token sequence that tells the model what to do:

def __post_init__(self):
    # ... populate self.special_tokens dict ...
    sot_sequence = [sot]
    if self.language is not None:
        sot_sequence.append(sot + 1 + langs.index(self.language))
    if self.task is not None:
        task_token = transcribe if self.task == "transcribe" else translate
        sot_sequence.append(task_token)
    self.sot_sequence = tuple(sot_sequence)

For a multilingual English transcription, this produces:

<|startoftranscript|>  <|en|>  <|transcribe|>

For X→English translation from Spanish:

<|startoftranscript|>  <|es|>  <|translate|>

The full token sequence fed to the decoder looks like this:

sequenceDiagram
    participant Prompt as Previous Context
    participant SOT as SOT Sequence
    participant Content as Generated Content

    Note over Prompt: <|startofprev|>
    Note over Prompt: [previous window tokens]
    Note over SOT: <|startoftranscript|>
    Note over SOT: <|language|>
    Note over SOT: <|task|>
    Note over Content: <|timestamp|>
    Note over Content: [text tokens]
    Note over Content: <|timestamp|>
    Note over Content: <|endoftext|>

The <|startofprev|> token and preceding prompt tokens provide context from the previous window (when condition_on_previous_text is enabled, as we'll see in Article 5). The <|notimestamps|> token, available via sot_sequence_including_notimestamps, can be appended to the SOT sequence to disable timestamp generation entirely.

Language-Aware Word Splitting

The split_to_word_tokens() method dispatches to different word segmentation strategies based on the language:

def split_to_word_tokens(self, tokens: List[int]):
    if self.language in {"zh", "ja", "th", "lo", "my", "yue"}:
        return self.split_tokens_on_unicode(tokens)
    return self.split_tokens_on_spaces(tokens)

For Chinese, Japanese, Thai, Lao, Burmese, and Cantonese — languages that don't use spaces between words — the method splits at every valid unicode code point boundary via split_tokens_on_unicode(). For European and other space-delimited languages, split_tokens_on_spaces() groups BPE subword tokens that don't start with a space, handling punctuation as separate words.

This distinction matters for word-level timestamps (Article 6), where each "word" needs a start and end time. For Japanese, every character is essentially its own word; for English, multiple BPE tokens like ["un", "believ", "able"] merge into a single word "unbelievable".

The non_speech_tokens property returns token IDs that should be suppressed during generation — things like musical note symbols (♪♪♪), speaker tags ([DAVID]), and bracketed annotations. The implementation carefully handles multi-byte unicode symbols that might encode to multiple BPE tokens, suppressing only the first token in such cases to prevent those sequences from starting.

The get_tokenizer Factory

The top-level get_tokenizer() function is the public API for creating configured tokenizers. It's also @lru_cache'd, so repeated calls with the same arguments return the same instance:

@lru_cache(maxsize=None)
def get_tokenizer(
    multilingual: bool,
    *, num_languages: int = 99,
    language: Optional[str] = None,
    task: Optional[str] = None,
) -> Tokenizer:

It resolves language aliases, selects the right encoding ("multilingual" vs "gpt2"), and for English-only models, explicitly sets language=None and task=None to omit those tokens from the SOT sequence.

The caching here is important for performance: the Tokenizer.__post_init__ builds the special_tokens dict by iterating over all special token strings, and get_encoding() parses the vocabulary file. Both are non-trivial operations that should happen only once.

What's Next

With the tokenizer's vocabulary and special token structure understood, we're ready to examine how the decoder uses these tokens. In Article 4, we'll dive into the decoding system — the most architecturally rich part of the codebase — covering the Strategy pattern for token selection, the chain-of-responsibility LogitFilter pipeline, and the KV-cache implementation that makes autoregressive generation efficient.