From Source Code to ZIR: The Zig Compiler Frontend
Prerequisites
- ›Article 1: Architecture of the Zig Compiler
- ›Familiarity with tokenization and recursive-descent parsing concepts
From Source Code to ZIR: The Zig Compiler Frontend
As we saw in Part 1, the Zig compiler's frontend lives in lib/std/zig/ rather than src/. This placement is not an accident — it's a deliberate design choice that enables zig fmt, the Zig Language Server (ZLS), and other tools to share the exact same tokenizer, parser, and AstGen code without pulling in Sema, codegen, or any linker. In this article, we'll trace how raw source bytes become ZIR (Zig Intermediate Representation), the flat instruction stream that feeds into semantic analysis.
Why the Frontend Lives in lib/std/zig/
The compiler frontend produces three artifacts in sequence: a token stream, an AST, and ZIR. All three representations are defined and produced within the standard library:
flowchart LR
subgraph "lib/std/zig/ (shared)"
TOK["tokenizer.zig"] --> PARSE["Parse.zig"]
PARSE --> AST["Ast.zig"]
AST --> AG["AstGen.zig"]
AG --> ZIR["Zir.zig"]
end
subgraph "Tools"
FMT["zig fmt"]
ZLS["ZLS"]
AC["zig ast-check"]
end
subgraph "src/ (compiler only)"
SEMA["Sema.zig"]
end
TOK -.->|"reuses"| FMT
AST -.->|"reuses"| FMT
AST -.->|"reuses"| ZLS
ZIR -->|"feeds into"| SEMA
This architecture means that zig fmt can parse and re-render source code using the same parser the compiler uses, guaranteeing format consistency. ZLS can build ASTs for IDE features without any compiler overhead. And zig ast-check can run AstGen to catch errors without invoking Sema.
The key insight: everything up through ZIR is untyped. No semantic analysis, no type resolution, no comptime evaluation. That's what makes it safe to share — these phases are pure syntax-level transformations.
Tokenization: Source Bytes to Token Stream
The tokenizer at lib/std/zig/tokenizer.zig converts raw UTF-8 source bytes into a stream of Token values. Each token is remarkably compact:
pub const Token = struct {
tag: Tag,
loc: Loc,
pub const Loc = struct {
start: usize,
end: usize,
};
};
That's it — a tag identifying the token type and a byte range into the original source. The Tag enum at the top of the file includes all Zig keywords (mapped via a StaticStringMap at line 12), operators, literals, and punctuation.
flowchart TD
SRC["Source: 'const x = 5;'"] --> T1["const → keyword_const"]
SRC --> T2["x → identifier"]
SRC --> T3["= → equal"]
SRC --> T4["5 → number_literal"]
SRC --> T5["; → semicolon"]
The tokenizer is designed to be lazy — it doesn't allocate memory or build a list. The parser calls a next() method that advances through the source one token at a time. However, the full compilation path does tokenize everything upfront into a MultiArrayList for random access. This is because later phases (AstGen, error reporting) need to jump to arbitrary token positions.
Tip: The keyword lookup uses
std.StaticStringMap, which is a compile-time-generated perfect hash map. This makes keyword detection O(1) without any runtime hash table overhead.
Parsing: Tokens to AST
The parser at lib/std/zig/Parse.zig is a classic recursive-descent parser. It consumes the token stream and produces an AST defined in lib/std/zig/Ast.zig.
The AST representation is unconventional. Rather than heap-allocated tree nodes with pointers, Zig uses a flat multi-array-list:
pub const NodeList = std.MultiArrayList(Node);
// Each node has:
// tag: Node.Tag — what kind of syntax node
// main_token: u32 — index into the token list
// data: Data — two u32 fields (lhs/rhs or extra indices)
This structure stores all nodes in a contiguous array, with tag, main_token, and data as parallel arrays (the MultiArrayList layout). Node children are referenced by integer index rather than pointer. For nodes with more than two children, the data field stores an index into a separate extra_data: []u32 array.
flowchart TD
subgraph "Flat AST Layout"
TAGS["tags: [fn_decl, block, return, ...]"]
TOKENS["tokens: [0, 5, 8, ...]"]
DATA["data: [{lhs:1, rhs:2}, ...]"]
EXTRA["extra_data: [3, 7, 9, ...]"]
end
DATA -->|"overflow"| EXTRA
The parser itself tracks state with just a few fields — a gpa allocator, the source bytes, a tokens slice, and the current token index tok_i. Functions like parseContainerDeclaration, parseExpr, and parseStatement follow the grammar rules directly.
The parser is error-recovering. Rather than aborting on the first syntax error, it records errors in an errors: std.ArrayList(AstError) and attempts to continue parsing. This is critical for IDE support where partial, malformed files are the norm.
AstGen: AST to ZIR
AstGen is where the tree-shaped AST gets flattened into ZIR — a linear instruction stream. This phase lives at lib/std/zig/AstGen.zig and is the most complex part of the frontend.
The AstGen struct carries substantial state during lowering:
gpa: Allocator,
tree: *const Ast,
nodes_need_rl: *const AstRlAnnotate.RlNeededSet,
instructions: std.MultiArrayList(Zir.Inst) = .{},
extra: ArrayList(u32) = .empty,
string_bytes: ArrayList(u8) = .empty,
source_offset: u32 = 0,
source_line: u32 = 0,
source_column: u32 = 0,
The source_offset, source_line, and source_column fields form a cursor that tracks position through the source file. This cursor is maintained throughout the entire lowering process to avoid O(N²) line scanning — a clever optimization for large files.
AstGen introduces several key concepts that don't exist at the AST level:
-
Result locations: Zig's result location semantics (where an expression "knows" where to write its result) are encoded into ZIR through the
nodes_need_rlannotation set. -
Scope tracking: AstGen manages lexical scopes, tracking which names are in scope and how
break/continue/returntargets resolve. -
Comptime annotations:
comptimeblocks and expressions get marked in ZIR so Sema knows to evaluate them at compile time. -
Source hash computation: AstGen computes incremental hashes of declaration bodies, stored in ZIR for later use by the incremental compilation system.
flowchart TD
AST["AST Node: fn_decl"] --> SCOPE["Push function scope"]
SCOPE --> PARAMS["Generate ZIR for parameters"]
PARAMS --> BODY["Generate ZIR for body"]
BODY --> RET["Handle return type / result location"]
RET --> POP["Pop function scope"]
POP --> ZIR["ZIR instructions appended"]
ZIR Data Structure: Instructions, Extra, and String Bytes
ZIR's memory layout is the culmination of the frontend pipeline. Defined in lib/std/zig/Zir.zig, it consists of three parallel data stores:
instructions: std.MultiArrayList(Inst).Slice,
string_bytes: []u8,
extra: []u32,
instructions is the core array. Each Inst has a tag (an enum(u8) identifying the instruction kind) and a data: u32 payload. The tag determines how to interpret the data — it might be a direct operand, an index into extra, or a reference to another instruction.
extra is a variable-length sidecar. Instructions with more data than fits in a single u32 store an index into extra, which holds additional fields. For example, a function call instruction stores the callee in its data field but the argument list in extra.
string_bytes pools all string data — identifiers, string literals, error messages. Instructions reference strings by index into this array.
flowchart LR
subgraph "ZIR Memory Layout"
direction TB
I["instructions\ntag: [alloc, load, call, ...]\ndata: [0, 3, 42, ...]"]
E["extra\n[arg0_ref, arg1_ref, ret_type, ...]"]
S["string_bytes\n['m','a','i','n',0,'x',0,...]"]
end
I -->|"data index"| E
I -->|"string index"| S
E -->|"string index"| S
When ZIR is cached to disk, it's prefixed with a Header that records the lengths of all three arrays plus file metadata (inode, size, mtime) for cache invalidation.
This flat representation has important performance properties. Unlike a tree, ZIR can be processed sequentially by Sema with excellent cache locality. There are no pointer chases — every reference is a u32 index. And the entire ZIR for a file can be serialized to and from disk as a contiguous blob.
Tip: ZIR is self-contained. Once generated, it holds everything Sema needs — no references back to the AST, token list, or source bytes. This is stated explicitly in the file header comment: "machine code, without any memory access into the AST tree token list, node list, or source bytes."
Triggering AstGen: From File Update to ZIR
The connection between the frontend and the compiler happens in src/Zcu/PerThread.zig. The updateFile() function coordinates the entire frontend pipeline for a single source file:
sequenceDiagram
participant Comp as Compilation
participant PT as PerThread
participant FS as FileSystem
participant FE as Frontend (lib/std/zig/)
Comp->>PT: updateFile(file_index, file)
PT->>FS: stat + open source file
PT->>FE: tokenize (source → tokens)
FE-->>PT: Token list
PT->>FE: parse (tokens → AST)
FE-->>PT: Ast
PT->>FE: AstGen (AST → ZIR)
FE-->>PT: Zir
PT->>PT: Cache ZIR to disk
The updateFile function first checks the file's stat to see if the source has changed since the last compilation. If the file hasn't changed and cached ZIR exists, it can skip the entire frontend pipeline. This is the first layer of incremental compilation — before Sema even gets involved, unchanged files reuse their cached ZIR.
When AstGen does run, the generated ZIR is cached to both global_zir_cache and local_zir_cache directories on the Zcu. On subsequent compilations, the cache check happens before tokenization, so unchanged files incur zero frontend cost.
What's Next
We've now traced source bytes through tokenization, parsing, and AstGen into ZIR. In Article 3, we'll cross the boundary from lib/std/zig/ into src/ and explore Sema — the 37K-line heart of the compiler that transforms untyped ZIR into fully-typed AIR. We'll also dive deep into the InternPool, the universal store where every type and value in the entire compilation lives as a u32 index.