Read OSS

From Source Text to AST: The Scanner, Parser, and Node System

Advanced

Prerequisites

  • Article 1: Architecture & Codebase Map
  • Understanding of lexer/tokenizer concepts and recursive descent parsing
  • Familiarity with TypeScript syntax (generics, JSX, decorators, template literals)

From Source Text to AST: The Scanner, Parser, and Node System

Every TypeScript compilation begins with a string of source text and ends — at least for the frontend — with a SourceFile: a fully-formed abstract syntax tree. Getting there requires two cooperating modules: a Scanner that breaks text into tokens and a Parser that assembles those tokens into a tree. Together, they account for roughly 15,000 lines of code and handle one of the most complex grammars in the programming language world — TypeScript's syntax is a superset of JavaScript's, with generics, JSX, decorators, template literal types, and satisfies expressions layered on top.

This article dissects both modules, traces the SyntaxKind classification system that labels every node, and maps out the Node type hierarchy that organizes the AST.

SyntaxKind: The Universal Node Classification

Everything in the TypeScript AST is classified by a single const enum: SyntaxKind. Defined starting at src/compiler/types.ts#L40, it begins with trivia and tokens, progresses through keywords, and eventually covers every possible expression, statement, declaration, and JSDoc node.

flowchart TD
    SK["SyntaxKind (const enum)"]
    SK --> Trivia["Trivia (0-8)<br/>Comments, Whitespace, Shebang"]
    SK --> Literals["Literals (9-15)<br/>NumericLiteral, StringLiteral, RegExp"]
    SK --> Punctuation["Punctuation (16-80)<br/>Braces, Operators, Arrows"]
    SK --> Keywords["Keywords (81-165)<br/>if, class, const, type, ..."]
    SK --> TypeNodes["Type Nodes<br/>TypeReference, UnionType, ConditionalType"]
    SK --> Expressions["Expressions<br/>CallExpression, BinaryExpression, ..."]
    SK --> Statements["Statements<br/>IfStatement, ForStatement, ReturnStatement"]
    SK --> Declarations["Declarations<br/>ClassDeclaration, FunctionDeclaration, ..."]
    SK --> JSDoc["JSDoc Nodes<br/>JSDocComment, JSDocTag, JSDocTypeExpression"]

A key design rule: token > SyntaxKind.Identifier means the token is a keyword. This single comparison powers fast keyword detection throughout the parser and scanner.

The enum is supplemented by union type aliases that group related kinds. TypeNodeSyntaxKind collects all type-position syntax kinds. TokenSyntaxKind covers everything the scanner can produce. JsxTokenSyntaxKind and JSDocSyntaxKind cover the specialized scanning modes. These types serve as constraints on what the scanner can return in different contexts.

Alongside SyntaxKind, the NodeFlags enum carries per-node metadata: whether a variable is let/const/using, whether the node was synthesized during transformation, whether it was parsed in an await or yield context, and flags that track errors and import presence.

The Scanner: Stateful Tokenization

The scanner is defined in src/compiler/scanner.ts — roughly 4,100 lines of code encapsulated in a single closure factory.

The Scanner Interface

The public Scanner interface reveals the design: it's a stateful cursor over text. You call scan() to advance to the next token, then query methods like getToken(), getTokenStart(), getTokenEnd(), and getTokenValue() to inspect the current position. There is no array of tokens — the parser pulls one token at a time.

The interface also exposes a family of reScan* methods: reScanGreaterToken(), reScanSlashToken(), reScanTemplateToken(), reScanJsxToken(), and others. These exist because TypeScript's grammar is context-sensitive — a > might be a greater-than operator or the end of a type argument list, a / might be division or the start of a regex, and } might close a block or open a template literal span.

createScanner: The Closure Pattern

flowchart LR
    CS["createScanner()"] --> Closure["Closure with var locals"]
    Closure --> pos["var pos: number"]
    Closure --> endVar["var end: number"]
    Closure --> token["var token: SyntaxKind"]
    Closure --> tokenValue["var tokenValue: string"]
    Closure --> scan["scan() → SyntaxKind"]
    Closure --> reScan["reScan*() methods"]

createScanner() creates a scanner by closing over var locals for position, token state, and error handling. The function opens with the now-familiar TDZ-avoidance comment:

// Why var? It avoids TDZ checks in the runtime which can be costly.
// See: https://github.com/microsoft/TypeScript/issues/52924

Using var instead of let inside a closure avoids the runtime cost of temporal dead zone checks. In a hot function called millions of times per compilation, this measurably improves performance. The entire compiler uses this pattern.

The core scan() function is a massive switch statement over the current character code. It handles everything from simple single-character tokens ({, }, (, )) to multi-character operators (===, >>>=), string literals with escape sequences, template literal spans, numeric literals (including 0x, 0o, 0b, and BigInt n suffixes), and regular expressions.

Scanning Modes

The scanner supports multiple modes for different syntactic contexts:

  • Normal mode: Standard TypeScript/JavaScript tokenization
  • JSX mode: scanJsxToken() and reScanJsxToken() handle JSX text content and element boundaries
  • Template literal mode: reScanTemplateToken() switches between template spans and embedded expressions
  • Regular expression mode: reScanSlashToken() re-interprets / as a regex delimiter, with full validation of regex syntax including named groups and Unicode property escapes
  • JSDoc mode: scanJsDocToken() handles the simplified tokenization needed for JSDoc comments

Tip: When the parser encounters an ambiguous token, it doesn't backtrack in the scanner — instead it calls the appropriate reScan* method to reinterpret the current token in context. This is far cheaper than maintaining scanner snapshots.

The Parser: Recursive Descent AST Construction

The parser lives in src/compiler/parser.ts — roughly 10,800 lines of recursive descent parsing. Its main public entry point is createSourceFile().

createSourceFile: The Entry Point

createSourceFile() takes a filename, source text, language version, and optional script kind, then delegates to Parser.parseSourceFile(). The function handles the distinction between JSON files (parsed with ScriptKind.JSON) and regular TypeScript/JavaScript files, and sets the impliedNodeFormat for Node.js module resolution (ESM vs CJS).

The parsing is wrapped with performance marks (beforeParse/afterParse) and optional tracing — the same instrumentation pattern used throughout the compiler for profiling.

The Recursive Descent Structure

The parser follows standard recursive descent: each grammar production gets its own function. parseStatement() dispatches based on the current token to parseIfStatement(), parseReturnStatement(), parseVariableStatement(), etc. Expression parsing uses precedence climbing via parseBinaryExpressionOrHigher(). Type parsing has its own parallel hierarchy: parseType(), parseUnionTypeOrHigher(), parseIntersectionTypeOrHigher(), and so on.

sequenceDiagram
    participant E as External Caller
    participant P as Parser
    participant S as Scanner
    participant F as NodeFactory

    E->>P: createSourceFile(fileName, text)
    P->>S: createScanner(...)
    P->>S: scan() → first token
    loop For each statement
        P->>S: getToken()
        P->>P: parseStatement()
        P->>F: factory.createIfStatement(...)
        P->>S: scan() → next token
    end
    P->>F: factory.createSourceFile(statements)
    P-->>E: SourceFile

Error Recovery and Speculation

The parser must be resilient — it needs to produce a usable AST even from broken code, since the language service depends on it for editor features. Several strategies make this work:

Missing tokens: When an expected token isn't found, the parser creates a "missing" node — a synthetic node with zero width at the current position — and reports an error. Parsing continues.

Lookahead/Speculation: For ambiguous syntax (is <T> a type argument list or a JSX element?), the parser uses lookAhead() and tryParse(). lookAhead() speculatively parses ahead and rolls back the scanner state if the speculation fails. tryParse() does the same but commits on success.

Error recovery via list parsing: The parser's list-parsing functions (used for statement lists, argument lists, parameter lists, etc.) know how to skip tokens that don't belong, using a set of expected token kinds to determine when to stop or resynchronize.

The Node Hierarchy and SourceFile

With SyntaxKind classifying every node and the parser creating them, the remaining question is how the type system organizes AST nodes.

Sub-interfaces for Cross-Cutting Concerns

Not every Node has the same capabilities. Rather than a single interface with optional properties, the TypeScript AST uses sub-interfaces:

classDiagram
    class Node {
        +kind: SyntaxKind
        +flags: NodeFlags
        +parent: Node
        +pos: number
        +end: number
    }
    class JSDocContainer {
        +jsDoc: JSDoc[]
    }
    class LocalsContainer {
        +locals: SymbolTable
        +nextContainer: HasLocals
    }
    class FlowContainer {
        +flowNode: FlowNode
    }
    class Declaration {
        +symbol: Symbol
        +localSymbol: Symbol
    }
    Node <|-- JSDocContainer
    Node <|-- LocalsContainer
    Node <|-- FlowContainer
    Node <|-- Declaration

LocalsContainer nodes have a locals symbol table — these are the nodes that introduce a new scope (functions, blocks, modules). FlowContainer nodes participate in control flow analysis. Declaration nodes have an associated symbol.

This separation was introduced in a specific PR to move symbol, locals, and flowNode off the base Node interface and onto the sub-interfaces that actually use them — reducing memory waste for the many node types that never need these fields.

The SourceFile Interface

SourceFile extends both Declaration and LocalsContainer, making it both the root AST node and a container for file-level symbols. It carries:

  • statements — the top-level statement list
  • fileName and path — file identity
  • text — the original source text (needed for error reporting and the language service)
  • languageVersion and languageVariant — affecting parsing behavior
  • isDeclarationFile.d.ts files get special treatment
  • impliedNodeFormat — ESM vs CJS for Node.js module resolution
  • Module indicators: externalModuleIndicator, commonJsModuleIndicator

The SourceFile is the primary unit of work throughout the compiler. The Program manages a set of SourceFile objects; the checker processes one SourceFile at a time; the emitter transforms and prints each SourceFile independently.

NodeFactory: Typed AST Construction

The parser doesn't call new to create nodes — it uses the NodeFactory, a comprehensive set of typed factory functions. factory.createIfStatement(expression, thenStatement, elseStatement) creates an IfStatement node with the correct kind, properly typed children, and TransformFlags pre-computed.

The factory is essential not just for the parser but also for the transformer pipeline (Article 5), which creates new synthetic nodes during downleveling. Having a single factory ensures nodes are consistently shaped and flagged, regardless of whether they originate from parsing or transformation.

Tip: When reading transform code, look for factory.create* and factory.update* calls. The update variants create a new node only if the children actually changed, enabling efficient structural sharing.

What's Next

We now have source text turned into a complete AST — a tree of Node objects classified by SyntaxKind. But an AST alone doesn't tell you what names mean or how scopes nest. In Part 3, we'll follow the Binder as it walks this AST to create Symbols, populate symbol tables, and construct the control flow graph that makes type narrowing possible.