μpb: The Lightweight C Runtime Powering Python, Ruby, and PHP
Prerequisites
- ›Article 4: Serialization, Arena, and TcTable (for comparison with C++ runtime)
- ›Basic C programming knowledge
μpb: The Lightweight C Runtime Powering Python, Ruby, and PHP
When you pip install protobuf in Python, gem install google-protobuf in Ruby, or pecl install protobuf in PHP, you're not getting the C++ protobuf runtime. You're getting μpb — a fast, small C protobuf implementation that's been purpose-built to serve as the backend for dynamic language runtimes.
μpb (micro protobuf, often written "upb") occupies a unique position in the protobuf ecosystem. It provides comparable parsing speed to the C++ runtime but in an order of magnitude less code. In this article, we'll explore why upb exists, how its MiniTable schema representation achieves compactness, how arena fusing solves a problem unique to upb's design, and how each dynamic language wraps upb through C extensions.
Why μpb Exists
The upb readme is refreshingly direct about upb's positioning. It highlights three advantages over the C++ runtime that are specifically relevant for embedding:
- No global state: No pre-
main()registration, no global descriptor pools. This is critical for dynamic languages where the protobuf library might be loaded and unloaded. - Optional reflection: Generated messages work the same whether reflection is linked in or not. In C++ protobuf, reflection and descriptors are deeply intertwined.
- Fast reflection-based parsing: Messages loaded at runtime (via reflection) parse just as fast as compiled-in messages. In C++ protobuf, there's a performance penalty for
DynamicMessage.
graph LR
subgraph "C++ Protobuf"
CPPSIZE["Large code size"]
CPPGLOBAL["Global state required"]
CPPREF["Reflection always present"]
end
subgraph "μpb"
UPBSIZE["Small code size"]
UPBNOGLOB["No global state"]
UPBOPTREF["Reflection optional"]
end
subgraph "Consumers"
PY["Python"] --> UPBSIZE
RB["Ruby"] --> UPBNOGLOB
PHP["PHP"] --> UPBOPTREF
HPB["HPB (C++)"] --> UPBSIZE
end
The practical consequence is dramatic: Python's protobuf library (with upb) is a few hundred KB of native code. If it used the C++ runtime, it would be multiple megabytes — and would bring along global state that doesn't play well with Python's module system.
MiniTable: Compact Schema Representation
Where the C++ runtime uses Descriptor objects with rich methods and string names, upb uses MiniTable — a minimal, flat representation optimized purely for field access and parsing.
A upb_MiniTable is a struct containing:
- An array of
upb_MiniTableFieldentries (one per field) - Pointers to sub-message MiniTables (for message-typed fields)
- Offsets for field data within the
upb_Messagestruct - Hasbit and oneof case information
classDiagram
class upb_MiniTable {
+upb_MiniTable_FindFieldByNumber()
+upb_MiniTable_GetFieldByIndex()
+upb_MiniTable_FieldCount()
+upb_MiniTable_SubMessage()
+upb_MiniTable_MapKey()
+upb_MiniTable_MapValue()
}
class upb_MiniTableField {
+field number
+field type
+offset in message
+presence (hasbit/oneof)
}
class upb_MiniTableEnum {
+value validation
}
upb_MiniTable --> upb_MiniTableField : fields array
upb_MiniTable --> upb_MiniTable : sub-message links
upb_MiniTableField --> upb_MiniTableEnum : enum validation
The API is designed around integer-indexed access. Rather than looking up fields by name (which requires hash tables and string comparisons), upb provides upb_MiniTable_FindFieldByNumber() for field-number-based lookup and upb_MiniTable_GetFieldByIndex() for positional access. The field-number lookup uses a dense or sparse representation depending on the field number distribution.
Crucially, MiniTables can be built at runtime from serialized data — this is how Python's descriptor_pool.c works. When you call pool.Add(file_descriptor_proto) in Python, the serialized descriptor is used to build a MiniTable that enables full-speed parsing for that message type. This is the "fast reflection-based parsing" feature the README highlights.
Tip: If you're writing a C library that needs to handle protobuf, upb's MiniTable approach is far more appropriate than linking the full C++ runtime. The generated C API gives you typed accessors, and the MiniTable can be built from descriptors at runtime if you need dynamic schemas.
upb Arena and Fusing
Like the C++ runtime, upb uses arena allocation. But upb's arena has a unique feature: arena fusing.
The problem fusing solves is this: what happens when a message on arena A references data on arena B? In C++ protobuf, this is managed by the generated code's copy semantics. But in upb (and the dynamic languages wrapping it), messages are often manipulated more freely — you might set a sub-message from one arena as a field of a message on another arena.
upb_Arena_Fuse() links two arenas' lifetimes together so that neither is freed until both have been released:
// Fuses the lifetime of two arenas, such that no arenas that have been
// transitively fused together will be freed until all of them have reached a
// zero refcount.
UPB_API bool upb_Arena_Fuse(const upb_Arena* a, const upb_Arena* b);
flowchart TD
subgraph "Before Fuse"
A1["Arena A<br/>(msg1)"]
A2["Arena B<br/>(msg2)"]
end
subgraph "After msg1.sub = msg2"
A3["Arena A<br/>(msg1)"]
A4["Arena B<br/>(msg2)"]
A3 ---|"Fused"| A4
end
subgraph "Lifetime"
A5["Neither freed until<br/>both released"]
end
A3 --> A5
A4 --> A5
Fusing is transitive — if A is fused with B, and B is fused with C, then all three share a lifetime. The implementation uses a union-find data structure, and the operation is thread-safe. There are important constraints documented in the header: you must not create reference cycles between arenas, and fusing two already-fused arenas is disallowed.
The arena also supports upb_Arena_IncRefFor() and upb_Arena_DecRefFor() for reference counting, and upb_Arena_SetAllocCleanup() for registering cleanup functions that run after all blocks are freed.
upb_Message and Wire Format
The upb_Message is upb's core message representation. Unlike C++ protobuf's generated classes with typed accessors, upb_Message is a generic type — all messages have the same C type, differentiated only by their associated upb_MiniTable.
Creating a message is straightforward:
upb_Message* msg = upb_Message_New(mini_table, arena);
The message stores field data at MiniTable-specified offsets, with separate storage for extensions and unknown fields. Unknown field iteration uses a segment-based API:
uintptr_t iter = kUpb_Message_UnknownBegin;
upb_StringView data;
while (upb_Message_NextUnknown(msg, &data, &iter)) {
// Process unknown field data
}
Wire format decoding lives in upb/wire/decode.h and offers several decode options as flags:
| Flag | Purpose |
|---|---|
kUpb_DecodeOption_AliasString |
Alias input buffer instead of copying strings |
kUpb_DecodeOption_CheckRequired |
Fail if required fields are missing |
kUpb_DecodeOption_AlwaysValidateUtf8 |
Enforce UTF-8 even for proto2 |
kUpb_DecodeOption_DisableFastTable |
Disable fast-table parser (debugging) |
The AliasString option is particularly important for dynamic languages. When parsing a message from a Python bytes object, the strings in the parsed message can point directly into the input buffer rather than copying. This only works if the input buffer outlives the message — which is naturally guaranteed when both are on the same arena.
Language Bindings: Python, Ruby, and PHP
Each dynamic language wraps upb through C extension modules. Python's implementation is the most mature and serves as the best example.
python/message.c reveals the bridging pattern through its includes. The file pulls in both Python C API headers (python/message.h, python/convert.h) and upb headers (upb/message/message.h, upb/wire/decode.h, upb/reflection/message.h). The Python message object wraps a upb_Message* with a reference to its arena.
python/descriptor_pool.c shows how the schema side works. The PyUpb_DescriptorPool struct wraps a upb_DefPool* — upb's reflection-level type registry (the "def" system):
typedef struct {
PyObject_HEAD
upb_DefPool* symtab;
PyObject* db; // The DescriptorDatabase underlying this pool
} PyUpb_DescriptorPool;
flowchart TD
subgraph "Python Layer"
PyMsg["PyUpb_Message<br/>(Python object)"]
PyPool["PyUpb_DescriptorPool<br/>(Python object)"]
end
subgraph "upb Layer"
UMsg["upb_Message*"]
UArena["upb_Arena*"]
UDef["upb_DefPool*"]
UMini["upb_MiniTable*"]
end
PyMsg --> UMsg
PyMsg --> UArena
PyPool --> UDef
UDef --> UMini
UMsg -.->|"field access via"| UMini
When Python code does msg.SerializeToString(), the C extension calls upb_Encode() with the message's upb_Message* and upb_MiniTable*. When it does msg.ParseFromString(data), it calls upb_Decode(). The Python wrapper handles type conversion (Python int ↔ upb int32/int64, Python str ↔ upb string), error reporting, and arena lifecycle management.
Ruby and PHP follow the same pattern. Ruby's extension in ruby/ext/google/protobuf_c/ and PHP's in php/ext/google/protobuf/ both wrap upb in the same way, adapted to each language's C extension API conventions.
Tip: When debugging a protobuf issue in Python, Ruby, or PHP, the most common problems are arena-related — messages referencing data on already-freed arenas. Arena fusing should prevent this, but if you're manipulating message internals directly (e.g., through the C extension API), be aware of arena lifetime boundaries.
upb's Place in the Ecosystem
upb is more than just a lightweight C library — it's the shared foundation for three major language runtimes. This architectural decision has significant implications: improvements to upb's parser immediately benefit Python, Ruby, and PHP. A bugfix in upb's arena management fixes all three languages at once.
In Article 6, we'll examine code generation patterns through two lenses: Rust's innovative dual-kernel architecture (using both C++ protobuf and upb as backends) with its proxy-based API, and the C++ code generator's strategy pattern for field type specialization. We'll also look at HPB, the new C++ API that builds on upb to offer a middle path between the full C++ runtime and raw upb.