From Source to Meaning: The Parser and Semantic Analyzer

A parser transforms a flat string of characters into a structured tree. A semantic analyzer gives that tree meaning — which variables are in scope, which names refer to which declarations, and where control flow can go. In Oxc, these two stages are deliberately separated: the parser focuses purely on syntax, producing an always-valid AST, while the SemanticBuilder handles the heavier work of scope resolution and symbol binding in a second pass. This separation is a performance choice, and understanding it is key to understanding why Oxc is fast.

Parser Architecture Overview

The parser lives in crates/oxc_parser and is a hand-written recursive descent parser — no parser generator involved. The architecture document explains the rationale: hand-written parsers produce faster code, give better error messages, and are easier to debug than generated ones. The trade-off is more implementation effort, but for a project maintaining long-term compatibility with the ever-evolving ECMAScript spec, this is acceptable.

The main entry point is the Parser struct and its parse() method, documented in crates/oxc_parser/src/lib.rs#L1-L66. The API is minimal:

let parser_return = Parser::new(&allocator, &source_text, source_type).parse();

Three inputs (allocator, source text, source type), one output struct. The ParserReturn at lib.rs#L144-L189 contains:

Field	Type	Purpose
`program`	`Program<'a>`	The AST (always present, even with errors)
`module_record`	`ModuleRecord<'a>`	Import/export declarations
`errors`	`Vec<OxcDiagnostic>`	Syntax errors encountered
`tokens`	`Vec<'a, Token>`	Optional token list
`panicked`	`bool`	Whether parser aborted early

sequenceDiagram
    participant S as Source Text
    participant L as Lexer
    participant P as Parser (recursive descent)
    participant A as Allocator
    participant R as ParserReturn
    
    S->>L: tokenize
    loop For each grammar production
        L->>P: next token
        P->>A: allocate AST node
    end
    P->>R: Program + errors + ModuleRecord

Error Recovery

A critical design property: program always contains a structurally valid AST, even when there are syntax errors. The parser recovers from errors by skipping tokens until it finds a synchronization point (like a semicolon or closing brace). This means downstream tools like the linter can always operate on the AST — they just need to check errors to know if the AST is semantically reliable.

When the parser encounters an unrecoverable error, it sets panicked = true and returns an empty program. But this is rare — most syntax errors are recoverable.

Tip: The maximum parseable file size is u32::MAX bytes (~4GB) because Span uses u32 offsets. This constant is defined at lib.rs#L113-L119 and is further constrained to isize::MAX on 32-bit systems.

The Lexer and Token Cursor

The parser interacts with the lexer through a cursor abstraction defined in crates/oxc_parser/src/cursor.rs. The cursor provides several essential operations:

// Check current token kind
pub(crate) fn cur_kind(&self) -> Kind { ... }

// Check if current token matches
pub(crate) fn at(&self, kind: Kind) -> bool { ... }

// Advance to next token
pub(crate) fn advance(&mut self, kind: Kind) { ... }

// Get source text for current token
pub(crate) fn cur_src(&self) -> &'a str { ... }

The cursor also supports checkpointing for backtracking. The ParserCheckpoint at cursor.rs#L14-L21 saves the lexer state, current token, span position, and error count — everything needed to rewind:

pub struct ParserCheckpoint<'a> {
    lexer: LexerCheckpoint<'a>,
    cur_token: Token,
    prev_span_end: u32,
    errors_pos: usize,
    fatal_error: Option<FatalError>,
}

Backtracking is necessary for TypeScript's ambiguous syntax. For example, <T> could be a type parameter, a JSX element, or a less-than comparison depending on context. The parser tries one interpretation, and if it fails, rewinds and tries another.

flowchart TD
    A["Source: &lt;T&gt;(x)"] --> B{Try TypeScript cast}
    B -->|Success| C[TypeAssertionExpression]
    B -->|Fail: checkpoint rewind| D{Try JSX element}
    D -->|Success| E[JSXElement]
    D -->|Fail: checkpoint rewind| F[BinaryExpression with &lt;]

JS, TypeScript, and JSX Parsing

The parser organizes its grammar rules into three subdirectories:

Directory	Coverage
`crates/oxc_parser/src/js/`	Core ECMAScript (statements, expressions, functions, classes, modules)
`crates/oxc_parser/src/ts/`	TypeScript extensions (types, interfaces, enums, decorators)
`crates/oxc_parser/src/jsx/`	JSX/TSX syntax

This organization mirrors the language layering: TypeScript extends JavaScript, and JSX extends both. The TypeScript parsing rules call into the JS rules for shared productions.

An important performance detail from the architecture doc: the parser deliberately avoids doing scope binding or symbol resolution. Things like checking whether a variable has been declared in the current scope, or resolving which declaration an identifier refers to, are all deferred to the semantic analyzer. This keeps the parser focused on one thing — building a syntactically correct AST — and keeps it fast.

SemanticBuilder: Scopes, Symbols, and References

After parsing, the next stage in the pipeline (as we saw in the CompilerInterface in Article 1) is semantic analysis. The SemanticBuilder at crates/oxc_semantic/src/builder.rs#L68-L120 traverses the parsed AST using the Visit trait and constructs three key data structures:

Scope tree — A tree of lexical scopes with parent pointers and flags
Symbol table — All declared names (variables, functions, classes, types) with their flags and locations
Reference list — All identifier references, resolved to their corresponding symbols

pub struct SemanticBuilder<'a> {
    pub(crate) source_text: &'a str,
    pub(crate) source_type: SourceType,
    pub(crate) errors: RefCell<Vec<OxcDiagnostic>>,
    pub(crate) current_scope_id: ScopeId,
    pub(crate) nodes: AstNodes<'a>,
    pub(crate) scoping: Scoping,
    pub(crate) unresolved_references: UnresolvedReferences<'a>,
    // ...
}

The builder implements Visit (the read-only AST visitor we'll detail in Article 4), walking the AST top-down. At each scope-creating node (function, block, for loop, etc.), it pushes a new scope onto the scope tree. At each binding (variable declaration, function parameter, import specifier), it creates a symbol. At each identifier reference, it attempts to resolve the reference to a symbol in an enclosing scope.

sequenceDiagram
    participant AST as Parsed AST
    participant SB as SemanticBuilder
    participant ST as Scope Tree
    participant SYM as Symbol Table
    participant REF as References
    
    AST->>SB: visit(Program)
    SB->>ST: push_scope(Top)
    AST->>SB: visit(FunctionDeclaration)
    SB->>SYM: declare_symbol("myFunc")
    SB->>ST: push_scope(Function)
    AST->>SB: visit(BindingIdentifier "x")
    SB->>SYM: declare_symbol("x")
    AST->>SB: visit(IdentifierReference "x")
    SB->>REF: resolve("x") → SymbolId
    SB->>ST: pop_scope()

The Scoping Output and Downstream Consumption

The key output of semantic analysis is the Scoping struct, defined in crates/oxc_semantic/src/scoping.rs#L88-L100. It uses a Struct-of-Arrays (SoA) layout for memory efficiency:

pub struct Scoping {
    /* Symbol Table - single allocation for all symbol-indexed flat fields */
    symbol_table: SymbolTable,
    pub(crate) references: IndexVec<ReferenceId, Reference>,
    pub(crate) no_side_effects: FxHashSet<SymbolId>,

    /* Scope Tree - single allocation for all scope-indexed flat fields */
    scope_table: ScopeTable,
    // ...
}

The ScopeTable and SymbolTable are generated using a multi_index_vec! macro that packs multiple parallel arrays into a single allocation. For example, the ScopeTable at scoping.rs#L42-L54 stores parent IDs, node IDs, and flags as three parallel arrays with a single length and capacity:

multi_index_vec! {
    struct ScopeTable<ScopeId> {
        parent_ids => parent_ids_mut: Option<ScopeId>,
        node_ids => node_ids_mut: NodeId,
        flags => flags_mut: ScopeFlags,
    }
}

This Scoping struct flows through the entire pipeline. As we saw in the CompilerInterface in Article 1, it passes from semantic analysis to the transformer (which modifies it as it inserts/removes bindings), then to the inject/define plugins, and finally to the mangler and codegen.

flowchart LR
    SB[SemanticBuilder] -->|"into_scoping()"| S[Scoping]
    S -->|"build_with_scoping"| T[Transformer]
    T -->|"transformer_return.scoping"| S2[Updated Scoping]
    S2 --> I[InjectGlobalVariables]
    I --> D[ReplaceGlobalDefines]
    D --> M[Mangler]
    M --> CG[Codegen]

Optional Control Flow Graph

When the cfg feature is enabled, SemanticBuilder also constructs a control flow graph (CFG) during traversal. The CFG is defined in crates/oxc_cfg and is used by advanced lint rules that need to reason about reachability. This is gated behind a feature flag because not all consumers need it — the CFG adds overhead to semantic analysis.

The conditional compilation is handled with a macro at builder.rs#L43-L58:

#[cfg(feature = "cfg")]
macro_rules! control_flow {
    ($self:ident, |$cfg:tt| $body:expr) => {
        if let Some($cfg) = &mut $self.cfg { $body } else { Default::default() }
    };
}

Tip: If you're writing a lint rule that needs control flow information (like dead code detection), make sure the cfg feature is enabled on oxc_semantic. Otherwise, the CFG won't be built and your rule won't have data to work with.

The Separation of Concerns

The deliberate split between parser and semantic analyzer is worth emphasizing. In many tools, these concerns are tangled — the parser tries to resolve variables as it goes, making the code harder to maintain and harder to optimize.

In Oxc:

The parser produces a syntactically valid tree. It knows nothing about scopes or symbols.
The SemanticBuilder walks that tree and adds semantic meaning. It fills in scope_id and symbol_id fields via Cell (interior mutability), which is why those fields are Cell<Option<ScopeId>> in the AST.

This separation means the parser can be optimized independently of semantic analysis, and the semantic analyzer can be rerun cheaply after AST mutations (which the transformer does).

What's Next

Now that we understand how the AST is produced and given semantic meaning, Article 4 will explore the two systems for walking that AST: the read-only Visit/VisitMut traits used by the semantic builder and lint rules, and the mutable Traverse system used by the transformer and minifier. We'll also see how the ast_tools code generator produces both from annotated AST definitions.

From Source to Meaning: The Parser and Semantic Analyzer

Prerequisites

From Source to Meaning: The Parser and Semantic Analyzer

Parser Architecture Overview

Error Recovery

The Lexer and Token Cursor

JS, TypeScript, and JSX Parsing

SemanticBuilder: Scopes, Symbols, and References

The Scoping Output and Downstream Consumption

Optional Control Flow Graph

The Separation of Concerns

What's Next