Read OSS

μpb: The Lightweight C Runtime Powering Python, Ruby, and PHP

Intermediate

Prerequisites

  • Article 4: Serialization, Arena, and TcTable (for comparison with C++ runtime)
  • Basic C programming knowledge

μpb: The Lightweight C Runtime Powering Python, Ruby, and PHP

When you pip install protobuf in Python, gem install google-protobuf in Ruby, or pecl install protobuf in PHP, you're not getting the C++ protobuf runtime. You're getting μpb — a fast, small C protobuf implementation that's been purpose-built to serve as the backend for dynamic language runtimes.

μpb (micro protobuf, often written "upb") occupies a unique position in the protobuf ecosystem. It provides comparable parsing speed to the C++ runtime but in an order of magnitude less code. In this article, we'll explore why upb exists, how its MiniTable schema representation achieves compactness, how arena fusing solves a problem unique to upb's design, and how each dynamic language wraps upb through C extensions.

Why μpb Exists

The upb readme is refreshingly direct about upb's positioning. It highlights three advantages over the C++ runtime that are specifically relevant for embedding:

  1. No global state: No pre-main() registration, no global descriptor pools. This is critical for dynamic languages where the protobuf library might be loaded and unloaded.
  2. Optional reflection: Generated messages work the same whether reflection is linked in or not. In C++ protobuf, reflection and descriptors are deeply intertwined.
  3. Fast reflection-based parsing: Messages loaded at runtime (via reflection) parse just as fast as compiled-in messages. In C++ protobuf, there's a performance penalty for DynamicMessage.
graph LR
    subgraph "C++ Protobuf"
        CPPSIZE["Large code size"]
        CPPGLOBAL["Global state required"]
        CPPREF["Reflection always present"]
    end
    
    subgraph "μpb"
        UPBSIZE["Small code size"]
        UPBNOGLOB["No global state"]
        UPBOPTREF["Reflection optional"]
    end
    
    subgraph "Consumers"
        PY["Python"] --> UPBSIZE
        RB["Ruby"] --> UPBNOGLOB
        PHP["PHP"] --> UPBOPTREF
        HPB["HPB (C++)"] --> UPBSIZE
    end

The practical consequence is dramatic: Python's protobuf library (with upb) is a few hundred KB of native code. If it used the C++ runtime, it would be multiple megabytes — and would bring along global state that doesn't play well with Python's module system.

MiniTable: Compact Schema Representation

Where the C++ runtime uses Descriptor objects with rich methods and string names, upb uses MiniTable — a minimal, flat representation optimized purely for field access and parsing.

A upb_MiniTable is a struct containing:

  • An array of upb_MiniTableField entries (one per field)
  • Pointers to sub-message MiniTables (for message-typed fields)
  • Offsets for field data within the upb_Message struct
  • Hasbit and oneof case information
classDiagram
    class upb_MiniTable {
        +upb_MiniTable_FindFieldByNumber()
        +upb_MiniTable_GetFieldByIndex()
        +upb_MiniTable_FieldCount()
        +upb_MiniTable_SubMessage()
        +upb_MiniTable_MapKey()
        +upb_MiniTable_MapValue()
    }
    
    class upb_MiniTableField {
        +field number
        +field type
        +offset in message
        +presence (hasbit/oneof)
    }
    
    class upb_MiniTableEnum {
        +value validation
    }
    
    upb_MiniTable --> upb_MiniTableField : fields array
    upb_MiniTable --> upb_MiniTable : sub-message links
    upb_MiniTableField --> upb_MiniTableEnum : enum validation

The API is designed around integer-indexed access. Rather than looking up fields by name (which requires hash tables and string comparisons), upb provides upb_MiniTable_FindFieldByNumber() for field-number-based lookup and upb_MiniTable_GetFieldByIndex() for positional access. The field-number lookup uses a dense or sparse representation depending on the field number distribution.

Crucially, MiniTables can be built at runtime from serialized data — this is how Python's descriptor_pool.c works. When you call pool.Add(file_descriptor_proto) in Python, the serialized descriptor is used to build a MiniTable that enables full-speed parsing for that message type. This is the "fast reflection-based parsing" feature the README highlights.

Tip: If you're writing a C library that needs to handle protobuf, upb's MiniTable approach is far more appropriate than linking the full C++ runtime. The generated C API gives you typed accessors, and the MiniTable can be built from descriptors at runtime if you need dynamic schemas.

upb Arena and Fusing

Like the C++ runtime, upb uses arena allocation. But upb's arena has a unique feature: arena fusing.

The problem fusing solves is this: what happens when a message on arena A references data on arena B? In C++ protobuf, this is managed by the generated code's copy semantics. But in upb (and the dynamic languages wrapping it), messages are often manipulated more freely — you might set a sub-message from one arena as a field of a message on another arena.

upb_Arena_Fuse() links two arenas' lifetimes together so that neither is freed until both have been released:

// Fuses the lifetime of two arenas, such that no arenas that have been
// transitively fused together will be freed until all of them have reached a
// zero refcount.
UPB_API bool upb_Arena_Fuse(const upb_Arena* a, const upb_Arena* b);
flowchart TD
    subgraph "Before Fuse"
        A1["Arena A<br/>(msg1)"]
        A2["Arena B<br/>(msg2)"]
    end
    
    subgraph "After msg1.sub = msg2"
        A3["Arena A<br/>(msg1)"]
        A4["Arena B<br/>(msg2)"]
        A3 ---|"Fused"| A4
    end
    
    subgraph "Lifetime"
        A5["Neither freed until<br/>both released"]
    end
    
    A3 --> A5
    A4 --> A5

Fusing is transitive — if A is fused with B, and B is fused with C, then all three share a lifetime. The implementation uses a union-find data structure, and the operation is thread-safe. There are important constraints documented in the header: you must not create reference cycles between arenas, and fusing two already-fused arenas is disallowed.

The arena also supports upb_Arena_IncRefFor() and upb_Arena_DecRefFor() for reference counting, and upb_Arena_SetAllocCleanup() for registering cleanup functions that run after all blocks are freed.

upb_Message and Wire Format

The upb_Message is upb's core message representation. Unlike C++ protobuf's generated classes with typed accessors, upb_Message is a generic type — all messages have the same C type, differentiated only by their associated upb_MiniTable.

Creating a message is straightforward:

upb_Message* msg = upb_Message_New(mini_table, arena);

The message stores field data at MiniTable-specified offsets, with separate storage for extensions and unknown fields. Unknown field iteration uses a segment-based API:

uintptr_t iter = kUpb_Message_UnknownBegin;
upb_StringView data;
while (upb_Message_NextUnknown(msg, &data, &iter)) {
    // Process unknown field data
}

Wire format decoding lives in upb/wire/decode.h and offers several decode options as flags:

Flag Purpose
kUpb_DecodeOption_AliasString Alias input buffer instead of copying strings
kUpb_DecodeOption_CheckRequired Fail if required fields are missing
kUpb_DecodeOption_AlwaysValidateUtf8 Enforce UTF-8 even for proto2
kUpb_DecodeOption_DisableFastTable Disable fast-table parser (debugging)

The AliasString option is particularly important for dynamic languages. When parsing a message from a Python bytes object, the strings in the parsed message can point directly into the input buffer rather than copying. This only works if the input buffer outlives the message — which is naturally guaranteed when both are on the same arena.

Language Bindings: Python, Ruby, and PHP

Each dynamic language wraps upb through C extension modules. Python's implementation is the most mature and serves as the best example.

python/message.c reveals the bridging pattern through its includes. The file pulls in both Python C API headers (python/message.h, python/convert.h) and upb headers (upb/message/message.h, upb/wire/decode.h, upb/reflection/message.h). The Python message object wraps a upb_Message* with a reference to its arena.

python/descriptor_pool.c shows how the schema side works. The PyUpb_DescriptorPool struct wraps a upb_DefPool* — upb's reflection-level type registry (the "def" system):

typedef struct {
    PyObject_HEAD
    upb_DefPool* symtab;
    PyObject* db;  // The DescriptorDatabase underlying this pool
} PyUpb_DescriptorPool;
flowchart TD
    subgraph "Python Layer"
        PyMsg["PyUpb_Message<br/>(Python object)"]
        PyPool["PyUpb_DescriptorPool<br/>(Python object)"]
    end
    
    subgraph "upb Layer"
        UMsg["upb_Message*"]
        UArena["upb_Arena*"]
        UDef["upb_DefPool*"]
        UMini["upb_MiniTable*"]
    end
    
    PyMsg --> UMsg
    PyMsg --> UArena
    PyPool --> UDef
    UDef --> UMini
    UMsg -.->|"field access via"| UMini

When Python code does msg.SerializeToString(), the C extension calls upb_Encode() with the message's upb_Message* and upb_MiniTable*. When it does msg.ParseFromString(data), it calls upb_Decode(). The Python wrapper handles type conversion (Python int ↔ upb int32/int64, Python str ↔ upb string), error reporting, and arena lifecycle management.

Ruby and PHP follow the same pattern. Ruby's extension in ruby/ext/google/protobuf_c/ and PHP's in php/ext/google/protobuf/ both wrap upb in the same way, adapted to each language's C extension API conventions.

Tip: When debugging a protobuf issue in Python, Ruby, or PHP, the most common problems are arena-related — messages referencing data on already-freed arenas. Arena fusing should prevent this, but if you're manipulating message internals directly (e.g., through the C extension API), be aware of arena lifetime boundaries.

upb's Place in the Ecosystem

upb is more than just a lightweight C library — it's the shared foundation for three major language runtimes. This architectural decision has significant implications: improvements to upb's parser immediately benefit Python, Ruby, and PHP. A bugfix in upb's arena management fixes all three languages at once.

In Article 6, we'll examine code generation patterns through two lenses: Rust's innovative dual-kernel architecture (using both C++ protobuf and upb as backends) with its proxy-based API, and the C++ code generator's strategy pattern for field type specialization. We'll also look at HPB, the new C++ API that builds on upb to offer a middle path between the full C++ runtime and raw upb.