Read OSS

Semantic Analysis and the InternPool: The Heart of the Compiler

Advanced

Prerequisites

  • Article 1: Architecture of the Zig Compiler
  • Article 2: From Source Code to ZIR
  • Understanding of type systems and type inference concepts
  • Familiarity with comptime evaluation in Zig

Semantic Analysis and the InternPool: The Heart of the Compiler

Sema — short for semantic analysis — is where ZIR becomes real. It's the phase that resolves types, evaluates comptime expressions, performs type checking, and produces AIR (Analyzed IR). At roughly 37,000 lines, src/Sema.zig is the single largest file in the compiler, and its opening comment says it plainly: "This is the heart of the Zig compiler."

But Sema doesn't work alone. It relies on the InternPool — a universal store for all types and values, both represented as u32 indices. This design decision, unifying types and values into a single interning pool, is one of the most distinctive architectural choices in the Zig compiler.

Sema's Role and Core State

Sema transforms untyped ZIR into typed AIR. It's instantiated per-function (or per-comptime block) and lives on the stack. Here are its key fields from src/Sema.zig#L41-L77:

pt: Zcu.PerThread,           // thread-safe Zcu access
gpa: Allocator,              // general purpose allocator
arena: Allocator,            // temporary arena, cleared on Sema destroy
code: Zir,                   // input ZIR
air_instructions: std.MultiArrayList(Air.Inst) = .{},  // output AIR
air_extra: std.ArrayList(u32) = .empty,
inst_map: InstMap = .{},     // maps ZIR indices → AIR indices
owner: AnalUnit,             // the root entity being analyzed
func_index: InternPool.Index,// current function being analyzed
fn_ret_ty: Type,             // return type of current function
branch_quota: u32 = default_branch_quota,
branch_count: u32 = 0,
classDiagram
    class Sema {
        +pt: PerThread
        +code: Zir
        +air_instructions: MultiArrayList
        +inst_map: InstMap
        +owner: AnalUnit
        +func_index: Index
        +fn_ret_ty: Type
        +branch_quota: u32
        +analyzeBodyInner()
    }
    class Zir {
        +instructions: Slice
        +extra: []u32
        +string_bytes: []u8
    }
    class Air {
        +instructions: Slice
        +extra: ArrayList
    }
    Sema --> Zir : reads
    Sema --> Air : writes

The inst_map is crucial — it maps ZIR instruction indices to their corresponding AIR instruction references. When a ZIR instruction references another instruction (e.g., %5 references the result of %3), Sema uses inst_map to find the AIR equivalent.

The owner: AnalUnit identifies what is being analyzed. It never changes during a Sema's lifetime, even if an inline function call causes analysis of a different function's body.

The analyzeBodyInner Dispatch Loop

The core of Sema is analyzeBodyInner, the main dispatch loop starting at src/Sema.zig#L1154. It iterates through ZIR instructions and dispatches each one to a handler:

flowchart TD
    LOOP["analyzeBodyInner loop"] --> READ["Read ZIR instruction tag"]
    READ --> SWITCH["inst: switch(tags[inst])"]
    SWITCH -->|".alloc"| A["zirAlloc()"]
    SWITCH -->|".bit_and"| B["zirBitwise()"]
    SWITCH -->|".bitcast"| C["zirBitcast()"]
    SWITCH -->|".call"| D["zirCall()"]
    SWITCH -->|".cmp_lt"| E["zirCmp()"]
    SWITCH -->|"..."| F["200+ other handlers"]
    A --> AIR["Produce AIR instruction"]
    B --> AIR
    C --> AIR
    D --> AIR
    E --> AIR
    AIR --> NEXT["i += 1; continue loop"]

The dispatch uses Zig's labeled switch pattern (inst: switch (tags[...])) which allows certain handlers to continue :inst to re-dispatch without incrementing the index. This is used for control flow transformations where a single ZIR instruction may need multiple passes.

Each handler function follows a consistent pattern:

  1. Extract operands from the ZIR instruction (often via extraData)
  2. Resolve operand types (looking up inst_map entries)
  3. Perform type checking and coercions
  4. If comptime-evaluable, compute the result at compile time
  5. Otherwise, emit an AIR instruction

The handler names follow a convention: zirAlloc, zirBitwise, zirCall, etc. — always prefixed with zir and named after the ZIR instruction tag they handle. A quick grep for fn zir reveals over 200 handler functions.

Tip: When debugging a semantic analysis issue, find the ZIR instruction tag (you can dump ZIR with zig dump-zir) and search for the corresponding zir* handler function in Sema.zig.

The InternPool: Universal Type and Value Store

The InternPool at src/InternPool.zig is one of the most important data structures in the compiler. Its opening comment is concise: "All interned objects have both a value and a type. This data structure is self-contained."

The key design decision: types and values are the same thing. Both are u32 indices into the InternPool. The type u32 is an index. The value 42 is an index. The type *const u8 is an index. This unification simplifies the entire compiler — there's one lookup mechanism, one interning mechanism, one equality comparison (just compare indices).

The pool is sharded for concurrent access:

locals: []Local,   // one per thread, indexed by tid
shards: []Shard,   // power-of-two count for concurrent writers

Each Local has its own allocation arena, and the Shard array uses locking to allow multiple threads to intern values simultaneously. The tid_shift_30, tid_shift_31, and tid_shift_32 fields are cached bit-shift amounts that embed the thread ID into the top bits of indices, ensuring each thread produces globally-unique indices without coordination.

Pre-Interned Types

The Index enum starts with a block of pre-interned common types that don't require any lookup:

pub const Index = enum(u32) {
    u0_type, i0_type, u1_type,
    u8_type, i8_type, u16_type, i16_type,
    u32_type, i32_type, u64_type, i64_type,
    // ... many more ...
    bool_type, void_type, type_type,
    anyerror_type, comptime_int_type, noreturn_type,
    // ...
};

These are known at compile time and embedded directly in the enum. Checking if a type is bool is a single integer comparison: index == .bool_type. No hash lookup, no indirection.

Type and Value: Thin Wrappers Around InternPool.Index

To provide ergonomic APIs, the compiler wraps InternPool.Index in two newtype structs:

src/Type.zig:

ip_index: InternPool.Index,

pub fn zigTypeTag(ty: Type, zcu: *const Zcu) std.builtin.TypeId {
    return zcu.intern_pool.zigTypeTag(ty.toIntern());
}

src/Value.zig:

ip_index: InternPool.Index,

Both are exactly one u32 in size. They never copy data out of the pool — they just provide methods that look up properties via the InternPool. This is critical for performance: passing a Type around is passing a single integer, and two types are equal if and only if their indices are equal.

classDiagram
    class InternPool {
        +locals: []Local
        +shards: []Shard
        +Index: enum(u32)
        +zigTypeTag(Index) TypeId
        +typeOf(Index) Index
        +indexToKey(Index) Key
    }
    class Type {
        +ip_index: Index
        +zigTypeTag(Zcu) TypeId
        +abiSize(Zcu) u64
    }
    class Value {
        +ip_index: Index
        +typeOf(Zcu) Type
        +isUndef(Zcu) bool
    }
    Type --> InternPool : wraps Index
    Value --> InternPool : wraps Index

Tip: When reading Sema code, you'll often see .toIntern() and .fromInterned() — these convert between the wrapper types and raw InternPool.Index values.

Two concepts define the granularity of work in the compiler: Nav (Named Addressable Value) and AnalUnit (Analysis Unit).

An AnalUnit is a packed u64 combining a Kind and an id:

pub const AnalUnit = packed struct(u64) {
    kind: Kind,
    id: u32,

    pub const Kind = enum(u32) {
        @"comptime", nav_val, nav_ty, type, func, memoized_state,
    };
};

Each AnalUnit represents one unit of semantic analysis work. A function body analysis, a comptime block evaluation, a type resolution, a Nav value resolution — each is a distinct AnalUnit. These are the nodes in the dependency graph that powers incremental compilation (covered in Article 5).

A Nav represents a named declaration with a three-state lifecycle:

stateDiagram-v2
    [*] --> unresolved: Declaration discovered
    unresolved --> type_resolved: Type analysis complete
    type_resolved --> fully_resolved: Value analysis complete
    fully_resolved --> [*]: Sent to linker

The Nav struct stores the name, fully-qualified name, optional analysis info (namespace + ZIR index), and a status union that transitions from unresolvedtype_resolvedfully_resolved. Only fully-resolved Navs with runtime types get sent to the linker.

This two-phase resolution (type first, then value) is important: it allows the compiler to resolve all types before codegen starts, breaking circular dependencies that would arise if type and value resolution were interleaved.

AIR: The Typed Intermediate Representation

AIR is the output of Sema. Defined in src/Air.zig, it has the same flat structure as ZIR — instructions as a MultiArrayList, plus an extra array — but with critical differences:

  1. Every reference is typed. ZIR might say "add these two things"; AIR says "add these two i32 values, producing an i32."
  2. One AIR per function. ZIR covers an entire file; AIR is scoped to a single function.
  3. Comptime is resolved. if (comptime_condition) branches are evaluated; dead branches are eliminated.
  4. Generic instantiation is complete. Each concrete instantiation gets its own AIR.

The AIR instruction tags at line 38 are explicitly typed operations: add, add_safe, add_optimized, add_wrap, add_sat — five different addition instructions, each with precise semantics. This is in contrast to ZIR which has fewer, more general instructions.

flowchart LR
    subgraph "ZIR (untyped)"
        Z1[".add %3, %5"]
    end
    subgraph "Sema"
        S["Type check\nCoerce operands\nChoose AIR tag"]
    end
    subgraph "AIR (typed)"
        A1[".add_safe %3:i32, %5:i32 → i32"]
    end
    Z1 --> S --> A1

AIR also carries liveness information (computed separately in Air/Liveness.zig) that tells codegen which values are still alive at each instruction. This enables register allocation without a separate liveness analysis pass.

What's Next

We've now covered the transformation from ZIR to AIR — the core of the compiler. In Article 4, we'll follow AIR into the backend half: code generation (AIR → MIR → machine code) and linking. We'll examine the two-phase codegen approach, the AnyMir union that bridges backend-specific representations, and the self-hosted linkers that assemble the final binary.