Inside the zval: PHP's Type System and Memory Model
Prerequisites
- ›Article 1: Architecture and lifecycle context
- ›C unions, bitfields, and pointer arithmetic
- ›Basic understanding of reference counting and garbage collection
Inside the zval: PHP's Type System and Memory Model
In Part 1, we mapped the four architectural layers of php-src and traced the request lifecycle. Now we descend into the Zend Engine's foundation: the data structures that represent every PHP value at runtime. Whether you're reading the compiler, the VM, an extension, or the garbage collector, you'll encounter zval, zend_string, HashTable, and zend_object on nearly every line. Understanding their memory layout, the reference counting protocol, and the allocator that manages them is prerequisite knowledge for the rest of this series.
The key insight driving PHP's type system design is that every PHP value must fit in exactly 16 bytes. This alignment constraint shapes everything else.
The zval: PHP's Universal Value
The _zval_struct is defined in Zend/zend_types.h. It has three components packed into 16 bytes:
flowchart LR
subgraph zval["zval (16 bytes)"]
direction TB
subgraph value["zend_value (8 bytes)"]
V["lval: zend_long<br/>dval: double<br/>str: *zend_string<br/>arr: *zend_array<br/>obj: *zend_object<br/>ref: *zend_reference<br/>..."]
end
subgraph u1["u1 (4 bytes)"]
U1["type_info:<br/> type (uint8)<br/> type_flags (uint8)<br/> extra (uint16)"]
end
subgraph u2["u2 (4 bytes)"]
U2["next: uint32 (hash chain)<br/>cache_slot: uint32<br/>lineno: uint32<br/>num_args: uint32<br/>..."]
end
end
The zend_value union (8 bytes) holds the actual value. For simple types like IS_LONG and IS_DOUBLE, the value is stored inline — an integer or floating-point number sits right in the zval, no heap allocation needed. For complex types (IS_STRING, IS_ARRAY, IS_OBJECT), the union holds a pointer to a heap-allocated structure.
The u1 union (4 bytes) contains type_info, which packs the type tag, type flags, and an extra field. The type byte is one of the IS_* constants. The type_flags byte encodes whether the value is reference-counted (IS_TYPE_REFCOUNTED) and whether it can participate in garbage collection cycles (IS_TYPE_COLLECTABLE). Simple types like IS_LONG, IS_DOUBLE, IS_NULL, IS_TRUE, and IS_FALSE have no flags set — they're never refcounted because their value is inline.
The u2 union (4 bytes) is the clever part. This space would be padding anyway (the struct needs 16-byte alignment), so the engine reuses it as "piggyback" storage for different contexts. When a zval lives in a HashTable bucket, u2.next holds the collision chain pointer. When it's a literal in an op_array, u2.cache_slot holds the runtime cache offset. When it's used for argument info, u2.num_args stores the argument count.
Tip: The
u2reuse pattern is a recurring theme in php-src: rather than wasting alignment padding, the engine stores context-dependent metadata in the slack bytes. You'll see this inzend_string,Bucket, andzend_objectas well.
PHP Type Tags and the Type System
PHP's type system is represented by a set of integer constants in Zend/zend_types.h:
| Type Constant | Value | Refcounted? | Collectable? | Value Location |
|---|---|---|---|---|
IS_UNDEF |
0 | No | No | None (uninitialized) |
IS_NULL |
1 | No | No | None (no value needed) |
IS_FALSE |
2 | No | No | None (type IS the value) |
IS_TRUE |
3 | No | No | None (type IS the value) |
IS_LONG |
4 | No | No | Inline in zend_value.lval |
IS_DOUBLE |
5 | No | No | Inline in zend_value.dval |
IS_STRING |
6 | Yes | No | Pointer to zend_string |
IS_ARRAY |
7 | Yes | Yes | Pointer to zend_array |
IS_OBJECT |
8 | Yes | Yes | Pointer to zend_object |
IS_RESOURCE |
9 | Yes | No | Pointer to zend_resource |
IS_REFERENCE |
10 | Yes | Yes | Pointer to zend_reference |
Notice that IS_FALSE and IS_TRUE are separate types rather than a single boolean type with a value. This eliminates a branch: checking if ($x) is a type check, not a type check plus a value check.
The zend_type struct (used for typed properties and function parameters) is a separate concern. It encodes union types, intersection types, and nullable types as a bitmask of type tags combined with pointers to class entries. We'll see it again in Article 3 when we cover the compiler.
Reference Counting and Copy-on-Write
All heap-allocated types share a common header: zend_refcounted_h. It contains a refcount (32-bit) and a type_info field that encodes the type and GC flags.
When you assign a variable in PHP ($b = $a), the engine doesn't copy the underlying data. It increments the refcount on the pointed-to structure. Both $a and $b now point to the same zend_string, zend_array, or zend_object.
flowchart TD
A["$a = 'hello'"] --> |"refcount=1"| STR["zend_string<br/>'hello'"]
B["$b = $a"] --> |"refcount=2"| STR
C["$b .= ' world'"] --> |"refcount > 1?<br/>Yes → separate!"| SEP{{"Copy on Write"}}
SEP --> STR2["zend_string<br/>'hello world'<br/>refcount=1"]
SEP --> STR3["zend_string<br/>'hello'<br/>refcount=1"]
The copy-on-write (COW) path is critical for performance. When the engine needs to mutate a value (e.g., appending to a string, pushing to an array), it first checks the refcount. If refcount > 1, it separates — makes a copy, decrements the original's refcount, and mutates the copy. If refcount == 1, it mutates in place.
PHP references ($b = &$a) use a different mechanism: a zend_reference wrapper. The zend_reference struct contains its own refcount header plus an inner zval. Both $a and $b become IS_REFERENCE zvals pointing to the same zend_reference, which itself holds the real value. This adds one level of indirection but preserves the COW semantics for non-reference values.
Tip: This is why PHP references (
&$var) often hurt performance rather than help. They force azend_referenceindirection and prevent copy-on-write optimizations. In modern PHP, you almost never need them.
zend_string: Interned and Persistent Strings
The zend_string struct (defined in Zend/zend_types.h) is more than a refcounted character buffer:
flowchart LR
subgraph zend_string
direction TB
RC["zend_refcounted_h (8 bytes)<br/>refcount + type_info + GC flags"]
HASH["h: zend_ulong (8 bytes)<br/>Cached hash value"]
LEN["len: size_t (8 bytes)<br/>String length"]
VAL["val[1]: char (flexible array)<br/>Actual string data, NUL-terminated"]
end
The cached hash value (h) is computed once and stored. Since strings are used as hash keys constantly (variable names, function names, array keys), this avoids recomputing the hash on every lookup. The ZSTR_H(), ZSTR_VAL(), and ZSTR_LEN() macros in Zend/zend_string.h provide access.
Interned strings are a special category. Common strings like function names, class names, and string literals from source code are "interned" — stored once in a global table and never freed during the request. Their refcount is set to a special value that causes the refcount macros to skip increment/decrement operations entirely. OPcache takes this further by interning strings into shared memory so they're shared across all FPM worker processes.
Persistent strings are allocated with the system allocator (malloc) instead of the per-request allocator (emalloc). They survive across requests and are used for data that lives at the module level — extension names, INI entry names, and class names registered during MINIT.
HashTable: The Dual-Mode Array
PHP's array type is one of the most versatile data structures in any programming language — it's simultaneously a list, a dictionary, an ordered map, a stack, a queue, and a set. The implementation needs to be fast for all these use cases.
The zend_array (aliased as HashTable) in Zend/zend_types.h operates in two distinct modes:
Packed arrays are used when keys are sequential integers starting from 0 (the common $arr[] = value pattern). The data is stored as a flat zval array via arPacked — no hash computation, no buckets, no collision chains. This is as fast as a C array.
Hash arrays are used when keys are strings or non-sequential integers. The data is stored as an array of Bucket structs (each containing a key, hash, and zval), with a hash index table for O(1) lookups.
flowchart TD
subgraph packed["Packed Array Mode"]
direction LR
P_HT["HashTable header<br/>nTableSize, nNumUsed, nNumOfElements"]
P_DATA["arPacked: zval[]<br/>[0]: zval<br/>[1]: zval<br/>[2]: zval<br/>..."]
P_HT --> P_DATA
end
subgraph hash["Hash Array Mode"]
direction LR
H_HT["HashTable header<br/>nTableSize, nNumUsed, nNumOfElements"]
H_IDX["Hash Index Table<br/>(grows BACKWARDS from arData)<br/>[-1]: idx<br/>[-2]: idx<br/>..."]
H_DATA["arData: Bucket[]<br/>[0]: {h, key, val}<br/>[1]: {h, key, val}<br/>[2]: {h, key, val}<br/>..."]
H_HT --> H_DATA
H_IDX -.-> H_DATA
end
The hash array's memory layout is innovative. The hash index table and the bucket array share a single allocation. The bucket array (arData) grows forward from a base pointer, while the hash index table grows backward from the same pointer. This means arData[-1], arData[-2], etc., are hash index slots, while arData[0], arData[1], etc., are buckets. A single emalloc call allocates both, and a single efree releases both.
Collision resolution uses chaining via Bucket.val.u2.next — that "free" u2 field in the zval. Each bucket's u2.next points to the index of the next bucket in the chain, or -1 (HT_INVALID_IDX) for chain termination.
The HashTable flags (defined in Zend/zend_hash.h) track the array's state: HASH_FLAG_PACKED for packed mode, HASH_FLAG_UNINITIALIZED for lazily-allocated tables, HASH_FLAG_STATIC_KEYS when all keys are interned strings.
zend_object and Class Instances
When you write new Foo() in PHP, the engine allocates a zend_object whose layout is determined at compile time based on the class declaration:
classDiagram
class zend_object {
+zend_refcounted_h gc
+uint32_t handle
+zend_class_entry *ce
+zend_object_handlers *handlers
+HashTable *properties_table
+zval properties_table[]
}
class zend_class_entry {
+char type
+zend_string *name
+HashTable function_table
+HashTable properties_info
+int default_properties_count
+zval *default_properties_table
}
class zend_object_handlers {
+read_property()
+write_property()
+get_method()
+call_method()
+clone_obj()
+compare()
+cast_object()
+...26 more handlers
}
zend_object --> zend_class_entry : ce
zend_object --> zend_object_handlers : handlers
The properties_table is a flexible array member at the end of the struct. The number of slots is determined by ce->default_properties_count. Declared properties get fixed slots (compiled to numeric offsets), so $obj->name is a direct indexed access, not a hash lookup. Dynamic properties (added at runtime) spill into a separate HashTable.
The handlers vtable (zend_object_handlers in Zend/zend_object_handlers.h) is how PHP's object system supports operator overloading, property access interception, and custom comparison. Extensions can replace individual handlers to customize object behavior — this is how ArrayObject, SplFixedArray, and PDO statement objects work.
Memory Allocator: emalloc and Friends
PHP's per-request memory allocator, defined in Zend/zend_alloc.h and implemented in Zend/zend_alloc.c, is a key reason PHP handles request-based workloads so efficiently.
The allocator uses a three-tier strategy:
flowchart TD
REQ["emalloc(size)"] --> CHECK{"size category?"}
CHECK -->|"≤ 3072 bytes"| SMALL["Small allocation<br/>30 size-class bins<br/>Pre-allocated pages<br/>Free-list per bin"]
CHECK -->|"3073 – page size"| LARGE["Large allocation<br/>Page-aligned chunks<br/>Best-fit search"]
CHECK -->|"> page size"| HUGE["Huge allocation<br/>Direct mmap()<br/>Tracked separately"]
SHUTDOWN["Request Shutdown"] --> BULK["Bulk deallocation<br/>Free all chunks at once<br/>No per-object free needed"]
Small allocations (≤ 3072 bytes, covering the vast majority of allocations) use a pool allocator with 30 size-class bins. Each bin has pre-allocated pages divided into fixed-size slots. Allocation is O(1): pop from the free list. Deallocation is O(1): push to the free list.
Large allocations use a page-based allocator within pre-allocated chunks. Huge allocations go directly to mmap().
The real trick is at request shutdown: instead of individually freeing every allocation, the engine can release entire memory chunks in bulk. This amortizes the cost of cleanup and means that even if PHP code "leaks" memory (doesn't explicitly unset() variables), the per-request allocator reclaims everything.
The API mirrors the standard C allocator: emalloc(), efree(), erealloc(), ecalloc(), and estrdup(). The pemalloc(size, persistent) variant switches between emalloc (per-request) and malloc (persistent/cross-request) based on the second argument.
Garbage Collector: Cycle Collection
Reference counting handles most memory management in PHP. But it has a classic weakness: circular references. If object A references object B and B references A, their refcounts never reach zero even when they're unreachable.
The cycle collector in Zend/zend_gc.c (with types in Zend/zend_gc.h) uses a variation of the Bacon-Rajan algorithm:
- Root collection: When a refcounted value's refcount is decremented but doesn't reach zero, it's a potential cycle root. The engine adds it to a root buffer.
- Trigger: When the root buffer fills (default: 10,000 entries), the GC runs.
- Mark gray: Starting from each root, recursively decrement refcounts of all reachable children. If a root's refcount reaches zero through this simulated collection, it might be garbage.
- Mark white: Scan again — values with zero simulated refcount are white (garbage). Values still referenced from outside the cycle are restored to black.
- Collect white: Free all white values.
Only types with the IS_TYPE_COLLECTABLE flag (arrays and objects) are considered as potential cycle roots. Strings and resources can be reference-counted but can't form cycles.
Tip: You can monitor GC activity with
gc_status(). If you see thousands of roots being collected frequently, you likely have a loop creating circular references — consider breaking the cycle withWeakReferenceor explicitunset().
What's Next
We've now covered the foundation: how PHP represents values in memory, manages that memory, and reclaims it. In Article 3, we'll follow a PHP source file through the compilation pipeline — from raw characters to a token stream, through the AST, and into the opcodes that the VM will execute. The zval knowledge from this article will be essential: literals become IS_CONST zvals in the op_array, and every compiled variable slot holds a zval.