From Source to Meaning: The Parser and Semantic Analyzer
Prerequisites
- ›Understanding of AST concepts and recursive descent parsing
- ›Articles 1-2: Architecture and Arena/AST Design
- ›Familiarity with ECMAScript specification structure
From Source to Meaning: The Parser and Semantic Analyzer
A parser transforms a flat string of characters into a structured tree. A semantic analyzer gives that tree meaning — which variables are in scope, which names refer to which declarations, and where control flow can go. In Oxc, these two stages are deliberately separated: the parser focuses purely on syntax, producing an always-valid AST, while the SemanticBuilder handles the heavier work of scope resolution and symbol binding in a second pass. This separation is a performance choice, and understanding it is key to understanding why Oxc is fast.
Parser Architecture Overview
The parser lives in crates/oxc_parser and is a hand-written recursive descent parser — no parser generator involved. The architecture document explains the rationale: hand-written parsers produce faster code, give better error messages, and are easier to debug than generated ones. The trade-off is more implementation effort, but for a project maintaining long-term compatibility with the ever-evolving ECMAScript spec, this is acceptable.
The main entry point is the Parser struct and its parse() method, documented in crates/oxc_parser/src/lib.rs#L1-L66. The API is minimal:
let parser_return = Parser::new(&allocator, &source_text, source_type).parse();
Three inputs (allocator, source text, source type), one output struct. The ParserReturn at lib.rs#L144-L189 contains:
| Field | Type | Purpose |
|---|---|---|
program |
Program<'a> |
The AST (always present, even with errors) |
module_record |
ModuleRecord<'a> |
Import/export declarations |
errors |
Vec<OxcDiagnostic> |
Syntax errors encountered |
tokens |
Vec<'a, Token> |
Optional token list |
panicked |
bool |
Whether parser aborted early |
sequenceDiagram
participant S as Source Text
participant L as Lexer
participant P as Parser (recursive descent)
participant A as Allocator
participant R as ParserReturn
S->>L: tokenize
loop For each grammar production
L->>P: next token
P->>A: allocate AST node
end
P->>R: Program + errors + ModuleRecord
Error Recovery
A critical design property: program always contains a structurally valid AST, even when there are syntax errors. The parser recovers from errors by skipping tokens until it finds a synchronization point (like a semicolon or closing brace). This means downstream tools like the linter can always operate on the AST — they just need to check errors to know if the AST is semantically reliable.
When the parser encounters an unrecoverable error, it sets panicked = true and returns an empty program. But this is rare — most syntax errors are recoverable.
Tip: The maximum parseable file size is
u32::MAXbytes (~4GB) becauseSpanusesu32offsets. This constant is defined atlib.rs#L113-L119and is further constrained toisize::MAXon 32-bit systems.
The Lexer and Token Cursor
The parser interacts with the lexer through a cursor abstraction defined in crates/oxc_parser/src/cursor.rs. The cursor provides several essential operations:
// Check current token kind
pub(crate) fn cur_kind(&self) -> Kind { ... }
// Check if current token matches
pub(crate) fn at(&self, kind: Kind) -> bool { ... }
// Advance to next token
pub(crate) fn advance(&mut self, kind: Kind) { ... }
// Get source text for current token
pub(crate) fn cur_src(&self) -> &'a str { ... }
The cursor also supports checkpointing for backtracking. The ParserCheckpoint at cursor.rs#L14-L21 saves the lexer state, current token, span position, and error count — everything needed to rewind:
pub struct ParserCheckpoint<'a> {
lexer: LexerCheckpoint<'a>,
cur_token: Token,
prev_span_end: u32,
errors_pos: usize,
fatal_error: Option<FatalError>,
}
Backtracking is necessary for TypeScript's ambiguous syntax. For example, <T> could be a type parameter, a JSX element, or a less-than comparison depending on context. The parser tries one interpretation, and if it fails, rewinds and tries another.
flowchart TD
A["Source: <T>(x)"] --> B{Try TypeScript cast}
B -->|Success| C[TypeAssertionExpression]
B -->|Fail: checkpoint rewind| D{Try JSX element}
D -->|Success| E[JSXElement]
D -->|Fail: checkpoint rewind| F[BinaryExpression with <]
JS, TypeScript, and JSX Parsing
The parser organizes its grammar rules into three subdirectories:
| Directory | Coverage |
|---|---|
crates/oxc_parser/src/js/ |
Core ECMAScript (statements, expressions, functions, classes, modules) |
crates/oxc_parser/src/ts/ |
TypeScript extensions (types, interfaces, enums, decorators) |
crates/oxc_parser/src/jsx/ |
JSX/TSX syntax |
This organization mirrors the language layering: TypeScript extends JavaScript, and JSX extends both. The TypeScript parsing rules call into the JS rules for shared productions.
An important performance detail from the architecture doc: the parser deliberately avoids doing scope binding or symbol resolution. Things like checking whether a variable has been declared in the current scope, or resolving which declaration an identifier refers to, are all deferred to the semantic analyzer. This keeps the parser focused on one thing — building a syntactically correct AST — and keeps it fast.
SemanticBuilder: Scopes, Symbols, and References
After parsing, the next stage in the pipeline (as we saw in the CompilerInterface in Article 1) is semantic analysis. The SemanticBuilder at crates/oxc_semantic/src/builder.rs#L68-L120 traverses the parsed AST using the Visit trait and constructs three key data structures:
- Scope tree — A tree of lexical scopes with parent pointers and flags
- Symbol table — All declared names (variables, functions, classes, types) with their flags and locations
- Reference list — All identifier references, resolved to their corresponding symbols
pub struct SemanticBuilder<'a> {
pub(crate) source_text: &'a str,
pub(crate) source_type: SourceType,
pub(crate) errors: RefCell<Vec<OxcDiagnostic>>,
pub(crate) current_scope_id: ScopeId,
pub(crate) nodes: AstNodes<'a>,
pub(crate) scoping: Scoping,
pub(crate) unresolved_references: UnresolvedReferences<'a>,
// ...
}
The builder implements Visit (the read-only AST visitor we'll detail in Article 4), walking the AST top-down. At each scope-creating node (function, block, for loop, etc.), it pushes a new scope onto the scope tree. At each binding (variable declaration, function parameter, import specifier), it creates a symbol. At each identifier reference, it attempts to resolve the reference to a symbol in an enclosing scope.
sequenceDiagram
participant AST as Parsed AST
participant SB as SemanticBuilder
participant ST as Scope Tree
participant SYM as Symbol Table
participant REF as References
AST->>SB: visit(Program)
SB->>ST: push_scope(Top)
AST->>SB: visit(FunctionDeclaration)
SB->>SYM: declare_symbol("myFunc")
SB->>ST: push_scope(Function)
AST->>SB: visit(BindingIdentifier "x")
SB->>SYM: declare_symbol("x")
AST->>SB: visit(IdentifierReference "x")
SB->>REF: resolve("x") → SymbolId
SB->>ST: pop_scope()
The Scoping Output and Downstream Consumption
The key output of semantic analysis is the Scoping struct, defined in crates/oxc_semantic/src/scoping.rs#L88-L100. It uses a Struct-of-Arrays (SoA) layout for memory efficiency:
pub struct Scoping {
/* Symbol Table - single allocation for all symbol-indexed flat fields */
symbol_table: SymbolTable,
pub(crate) references: IndexVec<ReferenceId, Reference>,
pub(crate) no_side_effects: FxHashSet<SymbolId>,
/* Scope Tree - single allocation for all scope-indexed flat fields */
scope_table: ScopeTable,
// ...
}
The ScopeTable and SymbolTable are generated using a multi_index_vec! macro that packs multiple parallel arrays into a single allocation. For example, the ScopeTable at scoping.rs#L42-L54 stores parent IDs, node IDs, and flags as three parallel arrays with a single length and capacity:
multi_index_vec! {
struct ScopeTable<ScopeId> {
parent_ids => parent_ids_mut: Option<ScopeId>,
node_ids => node_ids_mut: NodeId,
flags => flags_mut: ScopeFlags,
}
}
This Scoping struct flows through the entire pipeline. As we saw in the CompilerInterface in Article 1, it passes from semantic analysis to the transformer (which modifies it as it inserts/removes bindings), then to the inject/define plugins, and finally to the mangler and codegen.
flowchart LR
SB[SemanticBuilder] -->|"into_scoping()"| S[Scoping]
S -->|"build_with_scoping"| T[Transformer]
T -->|"transformer_return.scoping"| S2[Updated Scoping]
S2 --> I[InjectGlobalVariables]
I --> D[ReplaceGlobalDefines]
D --> M[Mangler]
M --> CG[Codegen]
Optional Control Flow Graph
When the cfg feature is enabled, SemanticBuilder also constructs a control flow graph (CFG) during traversal. The CFG is defined in crates/oxc_cfg and is used by advanced lint rules that need to reason about reachability. This is gated behind a feature flag because not all consumers need it — the CFG adds overhead to semantic analysis.
The conditional compilation is handled with a macro at builder.rs#L43-L58:
#[cfg(feature = "cfg")]
macro_rules! control_flow {
($self:ident, |$cfg:tt| $body:expr) => {
if let Some($cfg) = &mut $self.cfg { $body } else { Default::default() }
};
}
Tip: If you're writing a lint rule that needs control flow information (like dead code detection), make sure the
cfgfeature is enabled onoxc_semantic. Otherwise, the CFG won't be built and your rule won't have data to work with.
The Separation of Concerns
The deliberate split between parser and semantic analyzer is worth emphasizing. In many tools, these concerns are tangled — the parser tries to resolve variables as it goes, making the code harder to maintain and harder to optimize.
In Oxc:
- The parser produces a syntactically valid tree. It knows nothing about scopes or symbols.
- The SemanticBuilder walks that tree and adds semantic meaning. It fills in
scope_idandsymbol_idfields viaCell(interior mutability), which is why those fields areCell<Option<ScopeId>>in the AST.
This separation means the parser can be optimized independently of semantic analysis, and the semantic analyzer can be rerun cheaply after AST mutations (which the transformer does).
What's Next
Now that we understand how the AST is produced and given semantic meaning, Article 4 will explore the two systems for walking that AST: the read-only Visit/VisitMut traits used by the semantic builder and lint rules, and the mutable Traverse system used by the transformer and minifier. We'll also see how the ast_tools code generator produces both from annotated AST definitions.