From Source to Opcodes: PHP's Lexer, Parser, AST, and Compiler
Prerequisites
- ›Articles 1-2: Architecture context and zval/data structure knowledge
- ›Basic compiler theory (lexer, parser, AST concepts)
- ›Understanding of bitfield flags and enumerations
From Source to Opcodes: PHP's Lexer, Parser, AST, and Compiler
In the previous articles, we mapped php-src's architecture and dissected its core data structures. Now we follow a PHP source file through the pipeline that transforms it into executable opcodes. This pipeline runs every time PHP encounters a require, include, or the main script — unless OPcache intervenes with a cached result.
The compilation pipeline is surprisingly traditional in design: a lexer feeds tokens to a parser, the parser builds an AST, and a compiler walks the AST to emit a flat array of opcodes. What makes it interesting is the scale (the parser grammar has hundreds of productions, the AST has ~130 node kinds, and the compiler is one of the largest files in the repository) and the practical engineering choices that keep it fast enough to run on every request.
Pipeline Overview
The compilation pipeline has four stages, orchestrated by zend_compile_file():
flowchart LR
SRC["PHP Source<br/>(.php file)"] --> LEX["Lexer<br/>(re2c)<br/>zend_language_scanner.l"]
LEX --> |"Token stream"| PARSE["Parser<br/>(Bison)<br/>zend_language_parser.y"]
PARSE --> |"AST"| COMP["Compiler<br/>zend_compile.c"]
COMP --> |"zend_op_array"| VM["VM Execution<br/>or OPcache storage"]
OPCACHE["OPcache Hook"] -.-> |"Replaces<br/>zend_compile_file"| SRC
OPCACHE -.-> |"Returns cached"| VM
The critical extensibility point is that zend_compile_file is a global function pointer. OPcache replaces it during MINIT, intercepting compilation calls to return cached op_arrays from shared memory. This is the "hookable function pointer" pattern we'll explore more in Article 4.
The compiler lives primarily in Zend/zend_compile.c — one of the largest source files in the repository at over 10,000 lines. Let's trace data through each stage.
The re2c Lexer
The lexer is defined in Zend/zend_language_scanner.l, a re2c input file. Unlike flex/lex, re2c generates a direct-coded DFA — no tables, just goto statements and character comparisons. This produces a scanner that's significantly faster than table-driven alternatives.
The scanner has multiple states (called "conditions" in re2c):
| State | Triggers | Scans for |
|---|---|---|
ST_INITIAL / ST_IN_SCRIPTING |
Default | PHP keywords, operators, identifiers |
ST_LOOKING_FOR_PROPERTY |
After -> or ?-> |
Property/method names |
ST_DOUBLE_QUOTES |
" |
Interpolated string content |
ST_HEREDOC |
<<<IDENTIFIER |
Heredoc body with interpolation |
ST_NOWDOC |
<<<'IDENTIFIER' |
Nowdoc body (no interpolation) |
ST_BACKQUOTE |
` |
Shell exec interpolation |
ST_VAR_OFFSET |
$var[ in string |
Array offset inside string interpolation |
The scanner macros at the top of the file set up the re2c interface. SCNG() accesses scanner globals (the current buffer position, line number, file name). Each rule in the .l file returns a token constant like T_FUNCTION, T_CLASS, T_VARIABLE, etc., along with the matched text stored as a zval.
String handling is where the lexer gets complex. Heredoc and nowdoc require tracking the delimiter, interpolation within double-quoted strings requires nested scanning states, and PHP's flexible syntax for string interpolation ("$obj->prop", "{$arr['key']}") creates intricate state transitions.
Tip: If you want to see the tokens PHP produces for a given source file, use
php -w(strip whitespace) or thetoken_get_all()function which exposes the lexer directly to userspace.
The Bison Parser and Grammar
The parser is defined in Zend/zend_language_parser.y, a Bison grammar file. The generated parser is an LALR(1) parser — it reads tokens left to right, uses one token of lookahead, and reduces productions bottom-up.
The grammar's precedence declarations establish operator priority — from lowest (,, yield) to highest (unary operators, member access). The %left, %right, and %nonassoc declarations resolve ambiguity in expressions like $a + $b * $c or $a ?? $b ?? $c.
flowchart TD
TOP["top_statement_list"] --> STMT["statement"]
TOP --> FUNC["function_declaration_statement"]
TOP --> CLASS["class_declaration_statement"]
STMT --> EXPR["expr ';'"]
STMT --> IF["if_statement"]
STMT --> WHILE["while_statement"]
STMT --> FOR["for_statement"]
STMT --> RETURN["T_RETURN expr ';'"]
EXPR --> ASSIGN["variable '=' expr"]
EXPR --> BINARY["expr '+' expr"]
EXPR --> CALL["function_call"]
EXPR --> NEW["T_NEW class_name_reference"]
CLASS --> MEMBERS["class_statement_list"]
MEMBERS --> METHOD["method_declaration"]
MEMBERS --> PROP["property_declaration"]
Each Bison action (the C code in { } after a production rule) constructs AST nodes using the zend_ast_create* family of functions. For example, a binary addition expr '+' expr creates a ZEND_AST_BINARY_OP node with two children. A function declaration creates a ZEND_AST_FUNC_DECL node with children for the name, parameters, return type, and body.
The parser doesn't emit opcodes directly — that was PHP 5's approach. PHP 7+ separates parsing from compilation via the AST, which enables multi-pass analysis and better optimization opportunities.
AST Node Types and Structure
The AST node system is defined in Zend/zend_ast.h. There are approximately 130 ZEND_AST_* node kinds, organized by their structure:
| Category | Child Count | Examples |
|---|---|---|
| Special nodes | Variable | ZEND_AST_ZVAL (literal), ZEND_AST_ZNODE (compiler temp) |
| Declaration nodes | Fixed (4+) | ZEND_AST_FUNC_DECL, ZEND_AST_CLASS, ZEND_AST_METHOD |
| List nodes | Variable length | ZEND_AST_STMT_LIST, ZEND_AST_ARG_LIST, ZEND_AST_EXPR_LIST |
| 0-child nodes | 0 | (none currently — ZEND_AST_ZVAL covers literals) |
| 1-child nodes | 1 | ZEND_AST_VAR, ZEND_AST_RETURN, ZEND_AST_UNARY_PLUS |
| 2-child nodes | 2 | ZEND_AST_ASSIGN, ZEND_AST_BINARY_OP, ZEND_AST_WHILE |
| 3-child nodes | 3 | ZEND_AST_CONDITIONAL (ternary), ZEND_AST_FOR |
| 4-child nodes | 4 | ZEND_AST_IF_ELEM, ZEND_AST_FOR (init, cond, loop, body) |
The base zend_ast struct has kind (the node type) and attr (node-specific flags — e.g., the specific binary operator for ZEND_AST_BINARY_OP, or access modifiers for method declarations). There are three concrete struct variants:
zend_ast_zval: wraps azval(for literal values)zend_ast_decl: for function/class/method declarations, with extra fields for name, flags, doc comment, and child ASTszend_ast_list: for variable-length children (statement lists, argument lists)
The regular zend_ast struct with 1–4 child pointers covers everything else. This hierarchy keeps the AST allocation tight — most nodes are just a header plus a few pointers.
The Compiler: AST to Opcodes
The compiler in Zend/zend_compile.c is a single-pass recursive tree walker. The central dispatch function, zend_compile_stmt() for statements and zend_compile_expr() for expressions, switches on the AST node kind and calls a specialized compilation function for each.
flowchart TD
ENTRY["zend_compile_top_stmt()"] --> SWITCH{"node->kind?"}
SWITCH -->|"ZEND_AST_FUNC_DECL"| CF["zend_compile_func_decl()"]
SWITCH -->|"ZEND_AST_CLASS"| CC["zend_compile_class_decl()"]
SWITCH -->|"ZEND_AST_STMT_LIST"| CL["iterate children,<br/>call zend_compile_stmt()"]
SWITCH -->|"ZEND_AST_RETURN"| CR["zend_compile_return()"]
SWITCH -->|"other"| CE["zend_compile_stmt()<br/>→ zend_compile_expr()"]
CE --> EMIT["zend_emit_op()<br/>zend_emit_op_tmp()"]
EMIT --> OP["zend_op appended<br/>to current op_array"]
CF --> NEW_OP["New zend_op_array<br/>for function body"]
CC --> NEW_CE["New zend_class_entry<br/>with method op_arrays"]
The compiler maintains two context structs defined in Zend/zend_compile.h:
zend_file_context: per-file state (current namespace, import tables forusestatements)zend_oparray_context: per-function state (current op_array, jump targets for loops/try-catch, variable allocation)
Intermediate results during expression compilation use znode (defined in Zend/zend_compile.h), which tracks whether a result is a constant, a temporary variable, a compiled variable, or unused. The znode is the bridge between AST compilation and opcode emission — each zend_compile_expr_* function populates a result znode, and the parent expression uses it as an operand.
The zend_op Instruction Format
Each compiled instruction is a zend_op, defined in Zend/zend_compile.h:
flowchart LR
subgraph zend_op["zend_op struct"]
direction TB
HANDLER["handler: void*<br/>Function pointer to VM handler"]
OP1["op1: znode_op<br/>First operand (32-bit)"]
OP2["op2: znode_op<br/>Second operand (32-bit)"]
RESULT["result: znode_op<br/>Result location (32-bit)"]
EXT["extended_value: uint32_t<br/>Extra data (operator type, flags)"]
LINE["lineno: uint32_t"]
OPCODE["opcode: uint8_t"]
TYPES["op1_type, op2_type, result_type: uint8_t"]
end
Each operand has a type tag that determines how the 32-bit znode_op value is interpreted:
| Operand Type | Value | Meaning |
|---|---|---|
IS_UNUSED |
0 | Operand not used (or carries a jump target) |
IS_CONST |
1 (1<<0) |
Index into the literals table (compile-time constants) |
IS_TMP_VAR |
2 (1<<1) |
Temporary variable slot (expression intermediates) |
IS_VAR |
4 (1<<2) |
Variable slot (can hold references) |
IS_CV |
8 (1<<3) |
Compiled Variable — a named local variable ($foo) |
The extended_value field is overloaded per-opcode. For ZEND_ASSIGN_OP, it holds the specific operator (ZEND_ADD, ZEND_SUB, etc.). For ZEND_FETCH_OBJ, it holds cache slot offsets. For ZEND_CAST, it holds the target type.
The opcode constants are defined in Zend/zend_vm_opcodes.h. There are roughly 200 opcodes covering arithmetic, control flow, function calls, object operations, array operations, and more. Each opcode has a handler field — a function pointer set during compilation that points to the correct type-specialized handler in the VM. We'll explore the handler system in Article 4.
zend_op_array and zend_function
The output of compilation is a zend_op_array — the compiled representation of a function, method, or top-level script. It's defined in Zend/zend_compile.h and contains:
opcodes: the flat array ofzend_opinstructionsliterals: the constants table (zvals for literal values in the source)vars: the compiled variable names ($this,$param1,$localVar)arg_info: parameter type/default informationtry_catch_array: exception handling rangesstatic_variables:static $varinitializersfilename,line_start,line_end: source locationnum_args,required_num_args: argument count info
The zend_function union (in Zend/zend_compile.h) unifies user functions and internal (C) functions:
classDiagram
class zend_function {
<<union>>
+uint8_t type
+common: zend_function_common
+op_array: zend_op_array
+internal_function: zend_internal_function
}
class zend_function_common {
+uint8_t type
+uint32_t fn_flags
+zend_string *function_name
+zend_class_entry *scope
+zend_function *prototype
+zend_arg_info *arg_info
+uint32_t num_args
+uint32_t required_num_args
}
class zend_op_array {
+common fields...
+zend_op *opcodes
+zval *literals
+...compiled PHP function
}
class zend_internal_function {
+common fields...
+handler: zif_handler
+...C function
}
zend_function --> zend_function_common
zend_function --> zend_op_array
zend_function --> zend_internal_function
The common fields are laid out identically at the beginning of both zend_op_array and zend_internal_function. This means code that only needs the function name, flags, or argument info can access them through func->common.* without knowing whether it's a user or internal function.
Class, Method, and Property Flags
The compilation system uses an extensive bitfield for class and member modifiers, defined in Zend/zend_compile.h. The ZEND_ACC_* constants serve multiple roles:
| Flag | Bit | Hex | Applies to |
|---|---|---|---|
ZEND_ACC_PUBLIC |
1 << 0 |
0x00000001 |
Methods, properties, constants |
ZEND_ACC_PROTECTED |
1 << 1 |
0x00000002 |
Methods, properties, constants |
ZEND_ACC_PRIVATE |
1 << 2 |
0x00000004 |
Methods, properties, constants |
ZEND_ACC_STATIC |
1 << 4 |
0x00000010 |
Methods, properties |
ZEND_ACC_FINAL |
1 << 5 |
0x00000020 |
Classes, methods, properties, constants |
ZEND_ACC_ABSTRACT |
1 << 6 |
0x00000040 |
Classes, methods, properties |
ZEND_ACC_READONLY |
1 << 7 |
0x00000080 |
Properties |
ZEND_ACC_INTERFACE |
1 << 0 |
0x00000001 |
Classes (shared bit with PUBLIC — context-dependent!) |
ZEND_ACC_TRAIT |
1 << 1 |
0x00000002 |
Classes (shared bit with PROTECTED — context-dependent!) |
ZEND_ACC_ENUM |
1 << 28 |
0x10000000 |
Classes |
Notice that some bit positions are shared between contexts. ZEND_ACC_INTERFACE and ZEND_ACC_PUBLIC use the same bit (1 << 0), and ZEND_ACC_TRAIT shares its bit with ZEND_ACC_PROTECTED (1 << 1). This works because a class declaration is never simultaneously an interface and a method — the flag interpretation depends on whether it applies to a class entry or a function/property. The compiler validates that combinations are legal (you can't have abstract final, for example) and sets these flags during class/method declaration compilation.
Tip: When debugging flag values in GDB, cast
fn_flagsorce->ce_flagsto hex and cross-reference with theZEND_ACC_*table. For example,0x00000051meansPUBLIC | STATIC | ABSTRACT(0x01 + 0x10 + 0x40). Bit positions may vary between versions — always check the header.
What's Next
We've traced the journey from source text to compiled opcodes. In Article 4, we'll enter the virtual machine itself — the component that actually executes those opcodes. We'll explore the unique template-based code generation system that produces 123,000 lines of type-specialized handlers, the five dispatch modes (including the new TAILCALL mode), global register pinning for performance, and the SSA-based optimizer that transforms opcodes before they execute. The zend_op format and zend_op_array structure from this article are the VM's direct inputs.