Semantic Analysis and the InternPool: The Heart of the Compiler
Prerequisites
- ›Article 1: Architecture of the Zig Compiler
- ›Article 2: From Source Code to ZIR
- ›Understanding of type systems and type inference concepts
- ›Familiarity with comptime evaluation in Zig
Semantic Analysis and the InternPool: The Heart of the Compiler
Sema — short for semantic analysis — is where ZIR becomes real. It's the phase that resolves types, evaluates comptime expressions, performs type checking, and produces AIR (Analyzed IR). At roughly 37,000 lines, src/Sema.zig is the single largest file in the compiler, and its opening comment says it plainly: "This is the heart of the Zig compiler."
But Sema doesn't work alone. It relies on the InternPool — a universal store for all types and values, both represented as u32 indices. This design decision, unifying types and values into a single interning pool, is one of the most distinctive architectural choices in the Zig compiler.
Sema's Role and Core State
Sema transforms untyped ZIR into typed AIR. It's instantiated per-function (or per-comptime block) and lives on the stack. Here are its key fields from src/Sema.zig#L41-L77:
pt: Zcu.PerThread, // thread-safe Zcu access
gpa: Allocator, // general purpose allocator
arena: Allocator, // temporary arena, cleared on Sema destroy
code: Zir, // input ZIR
air_instructions: std.MultiArrayList(Air.Inst) = .{}, // output AIR
air_extra: std.ArrayList(u32) = .empty,
inst_map: InstMap = .{}, // maps ZIR indices → AIR indices
owner: AnalUnit, // the root entity being analyzed
func_index: InternPool.Index,// current function being analyzed
fn_ret_ty: Type, // return type of current function
branch_quota: u32 = default_branch_quota,
branch_count: u32 = 0,
classDiagram
class Sema {
+pt: PerThread
+code: Zir
+air_instructions: MultiArrayList
+inst_map: InstMap
+owner: AnalUnit
+func_index: Index
+fn_ret_ty: Type
+branch_quota: u32
+analyzeBodyInner()
}
class Zir {
+instructions: Slice
+extra: []u32
+string_bytes: []u8
}
class Air {
+instructions: Slice
+extra: ArrayList
}
Sema --> Zir : reads
Sema --> Air : writes
The inst_map is crucial — it maps ZIR instruction indices to their corresponding AIR instruction references. When a ZIR instruction references another instruction (e.g., %5 references the result of %3), Sema uses inst_map to find the AIR equivalent.
The owner: AnalUnit identifies what is being analyzed. It never changes during a Sema's lifetime, even if an inline function call causes analysis of a different function's body.
The analyzeBodyInner Dispatch Loop
The core of Sema is analyzeBodyInner, the main dispatch loop starting at src/Sema.zig#L1154. It iterates through ZIR instructions and dispatches each one to a handler:
flowchart TD
LOOP["analyzeBodyInner loop"] --> READ["Read ZIR instruction tag"]
READ --> SWITCH["inst: switch(tags[inst])"]
SWITCH -->|".alloc"| A["zirAlloc()"]
SWITCH -->|".bit_and"| B["zirBitwise()"]
SWITCH -->|".bitcast"| C["zirBitcast()"]
SWITCH -->|".call"| D["zirCall()"]
SWITCH -->|".cmp_lt"| E["zirCmp()"]
SWITCH -->|"..."| F["200+ other handlers"]
A --> AIR["Produce AIR instruction"]
B --> AIR
C --> AIR
D --> AIR
E --> AIR
AIR --> NEXT["i += 1; continue loop"]
The dispatch uses Zig's labeled switch pattern (inst: switch (tags[...])) which allows certain handlers to continue :inst to re-dispatch without incrementing the index. This is used for control flow transformations where a single ZIR instruction may need multiple passes.
Each handler function follows a consistent pattern:
- Extract operands from the ZIR instruction (often via
extraData) - Resolve operand types (looking up
inst_mapentries) - Perform type checking and coercions
- If comptime-evaluable, compute the result at compile time
- Otherwise, emit an AIR instruction
The handler names follow a convention: zirAlloc, zirBitwise, zirCall, etc. — always prefixed with zir and named after the ZIR instruction tag they handle. A quick grep for fn zir reveals over 200 handler functions.
Tip: When debugging a semantic analysis issue, find the ZIR instruction tag (you can dump ZIR with
zig dump-zir) and search for the correspondingzir*handler function in Sema.zig.
The InternPool: Universal Type and Value Store
The InternPool at src/InternPool.zig is one of the most important data structures in the compiler. Its opening comment is concise: "All interned objects have both a value and a type. This data structure is self-contained."
The key design decision: types and values are the same thing. Both are u32 indices into the InternPool. The type u32 is an index. The value 42 is an index. The type *const u8 is an index. This unification simplifies the entire compiler — there's one lookup mechanism, one interning mechanism, one equality comparison (just compare indices).
The pool is sharded for concurrent access:
locals: []Local, // one per thread, indexed by tid
shards: []Shard, // power-of-two count for concurrent writers
Each Local has its own allocation arena, and the Shard array uses locking to allow multiple threads to intern values simultaneously. The tid_shift_30, tid_shift_31, and tid_shift_32 fields are cached bit-shift amounts that embed the thread ID into the top bits of indices, ensuring each thread produces globally-unique indices without coordination.
Pre-Interned Types
The Index enum starts with a block of pre-interned common types that don't require any lookup:
pub const Index = enum(u32) {
u0_type, i0_type, u1_type,
u8_type, i8_type, u16_type, i16_type,
u32_type, i32_type, u64_type, i64_type,
// ... many more ...
bool_type, void_type, type_type,
anyerror_type, comptime_int_type, noreturn_type,
// ...
};
These are known at compile time and embedded directly in the enum. Checking if a type is bool is a single integer comparison: index == .bool_type. No hash lookup, no indirection.
Type and Value: Thin Wrappers Around InternPool.Index
To provide ergonomic APIs, the compiler wraps InternPool.Index in two newtype structs:
ip_index: InternPool.Index,
pub fn zigTypeTag(ty: Type, zcu: *const Zcu) std.builtin.TypeId {
return zcu.intern_pool.zigTypeTag(ty.toIntern());
}
ip_index: InternPool.Index,
Both are exactly one u32 in size. They never copy data out of the pool — they just provide methods that look up properties via the InternPool. This is critical for performance: passing a Type around is passing a single integer, and two types are equal if and only if their indices are equal.
classDiagram
class InternPool {
+locals: []Local
+shards: []Shard
+Index: enum(u32)
+zigTypeTag(Index) TypeId
+typeOf(Index) Index
+indexToKey(Index) Key
}
class Type {
+ip_index: Index
+zigTypeTag(Zcu) TypeId
+abiSize(Zcu) u64
}
class Value {
+ip_index: Index
+typeOf(Zcu) Type
+isUndef(Zcu) bool
}
Type --> InternPool : wraps Index
Value --> InternPool : wraps Index
Tip: When reading Sema code, you'll often see
.toIntern()and.fromInterned()— these convert between the wrapper types and rawInternPool.Indexvalues.
Navs and AnalUnits: Granularity of Compilation
Two concepts define the granularity of work in the compiler: Nav (Named Addressable Value) and AnalUnit (Analysis Unit).
An AnalUnit is a packed u64 combining a Kind and an id:
pub const AnalUnit = packed struct(u64) {
kind: Kind,
id: u32,
pub const Kind = enum(u32) {
@"comptime", nav_val, nav_ty, type, func, memoized_state,
};
};
Each AnalUnit represents one unit of semantic analysis work. A function body analysis, a comptime block evaluation, a type resolution, a Nav value resolution — each is a distinct AnalUnit. These are the nodes in the dependency graph that powers incremental compilation (covered in Article 5).
A Nav represents a named declaration with a three-state lifecycle:
stateDiagram-v2
[*] --> unresolved: Declaration discovered
unresolved --> type_resolved: Type analysis complete
type_resolved --> fully_resolved: Value analysis complete
fully_resolved --> [*]: Sent to linker
The Nav struct stores the name, fully-qualified name, optional analysis info (namespace + ZIR index), and a status union that transitions from unresolved → type_resolved → fully_resolved. Only fully-resolved Navs with runtime types get sent to the linker.
This two-phase resolution (type first, then value) is important: it allows the compiler to resolve all types before codegen starts, breaking circular dependencies that would arise if type and value resolution were interleaved.
AIR: The Typed Intermediate Representation
AIR is the output of Sema. Defined in src/Air.zig, it has the same flat structure as ZIR — instructions as a MultiArrayList, plus an extra array — but with critical differences:
- Every reference is typed. ZIR might say "add these two things"; AIR says "add these two
i32values, producing ani32." - One AIR per function. ZIR covers an entire file; AIR is scoped to a single function.
- Comptime is resolved.
if (comptime_condition)branches are evaluated; dead branches are eliminated. - Generic instantiation is complete. Each concrete instantiation gets its own AIR.
The AIR instruction tags at line 38 are explicitly typed operations: add, add_safe, add_optimized, add_wrap, add_sat — five different addition instructions, each with precise semantics. This is in contrast to ZIR which has fewer, more general instructions.
flowchart LR
subgraph "ZIR (untyped)"
Z1[".add %3, %5"]
end
subgraph "Sema"
S["Type check\nCoerce operands\nChoose AIR tag"]
end
subgraph "AIR (typed)"
A1[".add_safe %3:i32, %5:i32 → i32"]
end
Z1 --> S --> A1
AIR also carries liveness information (computed separately in Air/Liveness.zig) that tells codegen which values are still alive at each instruction. This enables register allocation without a separate liveness analysis pass.
What's Next
We've now covered the transformation from ZIR to AIR — the core of the compiler. In Article 4, we'll follow AIR into the backend half: code generation (AIR → MIR → machine code) and linking. We'll examine the two-phase codegen approach, the AnyMir union that bridges backend-specific representations, and the self-hosted linkers that assemble the final binary.