Read OSS

From Source Code to ZIR: The Zig Compiler Frontend

Intermediate

Prerequisites

  • Article 1: Architecture of the Zig Compiler
  • Familiarity with tokenization and recursive-descent parsing concepts

From Source Code to ZIR: The Zig Compiler Frontend

As we saw in Part 1, the Zig compiler's frontend lives in lib/std/zig/ rather than src/. This placement is not an accident — it's a deliberate design choice that enables zig fmt, the Zig Language Server (ZLS), and other tools to share the exact same tokenizer, parser, and AstGen code without pulling in Sema, codegen, or any linker. In this article, we'll trace how raw source bytes become ZIR (Zig Intermediate Representation), the flat instruction stream that feeds into semantic analysis.

Why the Frontend Lives in lib/std/zig/

The compiler frontend produces three artifacts in sequence: a token stream, an AST, and ZIR. All three representations are defined and produced within the standard library:

flowchart LR
    subgraph "lib/std/zig/ (shared)"
        TOK["tokenizer.zig"] --> PARSE["Parse.zig"]
        PARSE --> AST["Ast.zig"]
        AST --> AG["AstGen.zig"]
        AG --> ZIR["Zir.zig"]
    end

    subgraph "Tools"
        FMT["zig fmt"]
        ZLS["ZLS"]
        AC["zig ast-check"]
    end

    subgraph "src/ (compiler only)"
        SEMA["Sema.zig"]
    end

    TOK -.->|"reuses"| FMT
    AST -.->|"reuses"| FMT
    AST -.->|"reuses"| ZLS
    ZIR -->|"feeds into"| SEMA

This architecture means that zig fmt can parse and re-render source code using the same parser the compiler uses, guaranteeing format consistency. ZLS can build ASTs for IDE features without any compiler overhead. And zig ast-check can run AstGen to catch errors without invoking Sema.

The key insight: everything up through ZIR is untyped. No semantic analysis, no type resolution, no comptime evaluation. That's what makes it safe to share — these phases are pure syntax-level transformations.

Tokenization: Source Bytes to Token Stream

The tokenizer at lib/std/zig/tokenizer.zig converts raw UTF-8 source bytes into a stream of Token values. Each token is remarkably compact:

pub const Token = struct {
    tag: Tag,
    loc: Loc,

    pub const Loc = struct {
        start: usize,
        end: usize,
    };
};

That's it — a tag identifying the token type and a byte range into the original source. The Tag enum at the top of the file includes all Zig keywords (mapped via a StaticStringMap at line 12), operators, literals, and punctuation.

flowchart TD
    SRC["Source: 'const x = 5;'"] --> T1["const → keyword_const"]
    SRC --> T2["x → identifier"]
    SRC --> T3["= → equal"]
    SRC --> T4["5 → number_literal"]
    SRC --> T5["; → semicolon"]

The tokenizer is designed to be lazy — it doesn't allocate memory or build a list. The parser calls a next() method that advances through the source one token at a time. However, the full compilation path does tokenize everything upfront into a MultiArrayList for random access. This is because later phases (AstGen, error reporting) need to jump to arbitrary token positions.

Tip: The keyword lookup uses std.StaticStringMap, which is a compile-time-generated perfect hash map. This makes keyword detection O(1) without any runtime hash table overhead.

Parsing: Tokens to AST

The parser at lib/std/zig/Parse.zig is a classic recursive-descent parser. It consumes the token stream and produces an AST defined in lib/std/zig/Ast.zig.

The AST representation is unconventional. Rather than heap-allocated tree nodes with pointers, Zig uses a flat multi-array-list:

pub const NodeList = std.MultiArrayList(Node);

// Each node has:
//   tag: Node.Tag    — what kind of syntax node
//   main_token: u32  — index into the token list
//   data: Data       — two u32 fields (lhs/rhs or extra indices)

This structure stores all nodes in a contiguous array, with tag, main_token, and data as parallel arrays (the MultiArrayList layout). Node children are referenced by integer index rather than pointer. For nodes with more than two children, the data field stores an index into a separate extra_data: []u32 array.

flowchart TD
    subgraph "Flat AST Layout"
        TAGS["tags:    [fn_decl, block, return, ...]"]
        TOKENS["tokens:  [0, 5, 8, ...]"]
        DATA["data:    [{lhs:1, rhs:2}, ...]"]
        EXTRA["extra_data: [3, 7, 9, ...]"]
    end
    DATA -->|"overflow"| EXTRA

The parser itself tracks state with just a few fields — a gpa allocator, the source bytes, a tokens slice, and the current token index tok_i. Functions like parseContainerDeclaration, parseExpr, and parseStatement follow the grammar rules directly.

The parser is error-recovering. Rather than aborting on the first syntax error, it records errors in an errors: std.ArrayList(AstError) and attempts to continue parsing. This is critical for IDE support where partial, malformed files are the norm.

AstGen: AST to ZIR

AstGen is where the tree-shaped AST gets flattened into ZIR — a linear instruction stream. This phase lives at lib/std/zig/AstGen.zig and is the most complex part of the frontend.

The AstGen struct carries substantial state during lowering:

gpa: Allocator,
tree: *const Ast,
nodes_need_rl: *const AstRlAnnotate.RlNeededSet,
instructions: std.MultiArrayList(Zir.Inst) = .{},
extra: ArrayList(u32) = .empty,
string_bytes: ArrayList(u8) = .empty,
source_offset: u32 = 0,
source_line: u32 = 0,
source_column: u32 = 0,

The source_offset, source_line, and source_column fields form a cursor that tracks position through the source file. This cursor is maintained throughout the entire lowering process to avoid O(N²) line scanning — a clever optimization for large files.

AstGen introduces several key concepts that don't exist at the AST level:

  1. Result locations: Zig's result location semantics (where an expression "knows" where to write its result) are encoded into ZIR through the nodes_need_rl annotation set.

  2. Scope tracking: AstGen manages lexical scopes, tracking which names are in scope and how break/continue/return targets resolve.

  3. Comptime annotations: comptime blocks and expressions get marked in ZIR so Sema knows to evaluate them at compile time.

  4. Source hash computation: AstGen computes incremental hashes of declaration bodies, stored in ZIR for later use by the incremental compilation system.

flowchart TD
    AST["AST Node: fn_decl"] --> SCOPE["Push function scope"]
    SCOPE --> PARAMS["Generate ZIR for parameters"]
    PARAMS --> BODY["Generate ZIR for body"]
    BODY --> RET["Handle return type / result location"]
    RET --> POP["Pop function scope"]
    POP --> ZIR["ZIR instructions appended"]

ZIR Data Structure: Instructions, Extra, and String Bytes

ZIR's memory layout is the culmination of the frontend pipeline. Defined in lib/std/zig/Zir.zig, it consists of three parallel data stores:

instructions: std.MultiArrayList(Inst).Slice,
string_bytes: []u8,
extra: []u32,

instructions is the core array. Each Inst has a tag (an enum(u8) identifying the instruction kind) and a data: u32 payload. The tag determines how to interpret the data — it might be a direct operand, an index into extra, or a reference to another instruction.

extra is a variable-length sidecar. Instructions with more data than fits in a single u32 store an index into extra, which holds additional fields. For example, a function call instruction stores the callee in its data field but the argument list in extra.

string_bytes pools all string data — identifiers, string literals, error messages. Instructions reference strings by index into this array.

flowchart LR
    subgraph "ZIR Memory Layout"
        direction TB
        I["instructions\ntag: [alloc, load, call, ...]\ndata: [0, 3, 42, ...]"]
        E["extra\n[arg0_ref, arg1_ref, ret_type, ...]"]
        S["string_bytes\n['m','a','i','n',0,'x',0,...]"]
    end
    I -->|"data index"| E
    I -->|"string index"| S
    E -->|"string index"| S

When ZIR is cached to disk, it's prefixed with a Header that records the lengths of all three arrays plus file metadata (inode, size, mtime) for cache invalidation.

This flat representation has important performance properties. Unlike a tree, ZIR can be processed sequentially by Sema with excellent cache locality. There are no pointer chases — every reference is a u32 index. And the entire ZIR for a file can be serialized to and from disk as a contiguous blob.

Tip: ZIR is self-contained. Once generated, it holds everything Sema needs — no references back to the AST, token list, or source bytes. This is stated explicitly in the file header comment: "machine code, without any memory access into the AST tree token list, node list, or source bytes."

Triggering AstGen: From File Update to ZIR

The connection between the frontend and the compiler happens in src/Zcu/PerThread.zig. The updateFile() function coordinates the entire frontend pipeline for a single source file:

sequenceDiagram
    participant Comp as Compilation
    participant PT as PerThread
    participant FS as FileSystem
    participant FE as Frontend (lib/std/zig/)

    Comp->>PT: updateFile(file_index, file)
    PT->>FS: stat + open source file
    PT->>FE: tokenize (source → tokens)
    FE-->>PT: Token list
    PT->>FE: parse (tokens → AST)
    FE-->>PT: Ast
    PT->>FE: AstGen (AST → ZIR)
    FE-->>PT: Zir
    PT->>PT: Cache ZIR to disk

The updateFile function first checks the file's stat to see if the source has changed since the last compilation. If the file hasn't changed and cached ZIR exists, it can skip the entire frontend pipeline. This is the first layer of incremental compilation — before Sema even gets involved, unchanged files reuse their cached ZIR.

When AstGen does run, the generated ZIR is cached to both global_zir_cache and local_zir_cache directories on the Zcu. On subsequent compilations, the cache check happens before tokenization, so unchanged files incur zero frontend cost.

What's Next

We've now traced source bytes through tokenization, parsing, and AstGen into ZIR. In Article 3, we'll cross the boundary from lib/std/zig/ into src/ and explore Sema — the 37K-line heart of the compiler that transforms untyped ZIR into fully-typed AIR. We'll also dive deep into the InternPool, the universal store where every type and value in the entire compilation lives as a u32 index.