Read OSS

The Rust Integration and C++ Code Generator: A Study in Code Generation Patterns

Advanced

Prerequisites

  • Article 4: Serialization internals (TcTable, arena)
  • Article 5: μpb runtime architecture
  • Familiarity with Rust ownership model and C++ template patterns

The Rust Integration and C++ Code Generator: A Study in Code Generation Patterns

Code generation is where language design meets pragmatic engineering. Each target language imposes different constraints on how generated protobuf code should look, and the generator must navigate these constraints while producing code that's both ergonomic and performant.

In this article, we'll study two generators that represent opposite ends of the design spectrum: Rust's generator, which must solve novel ownership problems with a dual-kernel architecture, and the C++ generator, which uses a mature strategy pattern to handle the complexity of 10+ field type variants. We'll also look at HPB, a new C++ API that represents a third path forward.

Rust's Dual-Kernel Architecture

Rust's protobuf support is architecturally unique: it supports two completely different runtime backends. The rust/cpp_kernel/mod.rs module provides a backend powered by the full C++ protobuf runtime, while rust/upb_kernel/mod.rs provides one powered by upb.

Why two backends? The answer is pragmatic. Google's internal systems use C++ protobuf extensively, and Rust code running inside Google needs to interop with existing C++ message instances. External users, however, benefit from upb's smaller footprint. Rather than choosing one and forcing the other camp to adapt, the Rust team built an abstraction layer.

flowchart TD
    subgraph "User-Facing API"
        API["Rust Protobuf API<br/>View&lt;'msg, T&gt;, Mut&lt;'msg, T&gt;"]
    end
    
    subgraph "Kernel Abstraction"
        TRAIT["Kernel trait implementations<br/>(Message, Serialize, etc.)"]
    end
    
    subgraph "cpp_kernel"
        CPP["C++ protobuf FFI<br/>message.rs, repeated.rs,<br/>map.rs, string.rs"]
    end
    
    subgraph "upb_kernel"
        UPB["upb FFI<br/>message.rs, repeated.rs,<br/>map.rs, string.rs"]
    end
    
    API --> TRAIT
    TRAIT --> CPP
    TRAIT --> UPB

Both kernels export the same set of sub-modules: message, repeated, map, string, and raw. The cpp_kernel imports from std::ffi::{c_int, c_void} and works with raw C++ pointers through FFI. The upb_kernel imports upb's arena and mini-table types. The user never sees which kernel is active — the generated code and the runtime library present a unified API.

The Proxy Pattern: View and Mut Types

The most intellectually interesting aspect of Rust protobuf is its proxy type system. The rationale is explained in remarkable detail in rust/proxied.rs.

The problem: Rust references (&T and &mut T) don't work for protobuf field access. There are two reasons:

  1. Memory representation mismatch: A field's in-memory representation may differ from what the user expects. For example, rarely accessed int64 fields might use packed 32-bit storage, or presence information might be stored in a centralized bitset rather than inline. You can't form a &i64 reference if the value doesn't exist as a contiguous i64 in memory.

  2. Arena lifetime coupling: In upb, mutation operations (adding to a repeated field, setting a string) require an arena parameter. Rust's &mut T has no way to carry this extra context. Worse, &mut T allows mem::swap(), which could silently corrupt arena-owned data by swapping pointers across arenas.

The solution is proxy types:

pub trait Proxied: SealedInternal + AsView<Proxied = Self> + Sized + 'static {
    type View<'msg>: AsView<Proxied = Self> + IntoView<'msg>;
}

pub trait MutProxied: SealedInternal + Proxied + AsMut<MutProxied = Self> + 'static {
    type Mut<'msg>: AsMut<MutProxied = Self> + IntoMut<'msg> + IntoView<'msg>;
}
classDiagram
    class Proxied {
        <<trait>>
        +View~'msg~ : AsView
    }
    
    class MutProxied {
        <<trait>>
        +Mut~'msg~ : AsMut + IntoView
    }
    
    class ViewT["View&lt;'msg, T&gt;"] {
        "Type alias for T::View&lt;'msg&gt;"
        "Like &'msg T but can carry arena"
    }
    
    class MutT["Mut&lt;'msg, T&gt;"] {
        "Type alias for T::Mut&lt;'msg&gt;"
        "Like &'msg mut T but arena-safe"
    }
    
    Proxied <|-- MutProxied
    Proxied --> ViewT : defines
    MutProxied --> MutT : defines

View<'msg, T> is the read proxy (like &'msg T) and Mut<'msg, T> is the write proxy (like &'msg mut T). These are not plain references — they're smart wrappers that can carry an arena pointer, handle packed storage, and prevent unsafe swaps. The 'msg lifetime parameter ensures the proxy doesn't outlive the message it borrows from.

Tip: If you're designing a Rust API for arena-allocated data, protobuf's proxy pattern is a well-considered solution to the problem of "references to data that doesn't have a stable memory representation." Study proxied.rs — the comment block is one of the best design rationale documents in the entire repo.

Rust Code Generator

The RustGenerator is implemented as a straightforward CodeGenerator subclass:

class PROTOC_EXPORT RustGenerator final
    : public google::protobuf::compiler::CodeGenerator {
 public:
    bool Generate(const FileDescriptor* file, const std::string& parameter,
                  GeneratorContext* generator_context,
                  std::string* error) const override;

    uint64_t GetSupportedFeatures() const override {
        return FEATURE_PROTO3_OPTIONAL | FEATURE_SUPPORTS_EDITIONS;
    }
    Edition GetMinimumEdition() const override { return Edition::EDITION_PROTO2; }
    Edition GetMaximumEdition() const override { return Edition::EDITION_2024; }
};

The generator supports both proto2/proto3 and the Editions system (up to Edition 2024). Its Generate() method produces Rust source files that work with whichever kernel is linked. The generated code uses the Proxied / MutProxied traits, so field accessors return View and Mut types rather than plain references.

The Rust generator lives in src/google/protobuf/compiler/rust/ alongside the other language generators — it's a built-in generator, not a plugin. This was a deliberate choice to ensure tight integration with protoc's descriptor system and editions support.

C++ Code Generator: Strategy Pattern for Field Types

The C++ code generator is the most mature and complex generator in the repository. Its central design pattern is a strategy hierarchy for field code generation.

FieldGeneratorBase is the abstract base class that all field type generators inherit from. It provides a rich set of predicates for the field's properties:

class FieldGeneratorBase {
 public:
    bool should_split() const;           // Cold split section?
    bool is_trivial() const;             // int, float, double, enum, bool?
    bool has_trivial_value() const;      // Trivial or raw pointer?
    bool has_trivial_zero_default() const; // memset-zero initializable?
    bool is_message() const;             // Message or group type?
    bool is_weak() const;                // Weak message field?
    bool is_lazy() const;                // Lazy message field?
    bool is_string() const;              // String or bytes?
    // ... virtual methods for codegen
};
classDiagram
    class FieldGeneratorBase {
        <<abstract>>
        +should_split() bool
        +is_trivial() bool
        +is_message() bool
        +is_string() bool
        +GenerateAccessorDeclarations()*
        +GenerateAccessorDefinitions()*
        +GenerateMergingCode()*
        +GenerateSwappingCode()*
    }
    
    class PrimitiveFieldGenerator {
        "int32, int64, float, etc."
    }
    class StringFieldGenerator {
        "string, bytes"
    }
    class MessageFieldGenerator {
        "Nested messages"
    }
    class MapFieldGenerator {
        "map&lt;K, V&gt; fields"
    }
    class EnumFieldGenerator {
        "Enum-typed fields"
    }
    class CordFieldGenerator {
        "Cord-backed strings"
    }
    
    FieldGeneratorBase <|-- PrimitiveFieldGenerator
    FieldGeneratorBase <|-- StringFieldGenerator
    FieldGeneratorBase <|-- MessageFieldGenerator
    FieldGeneratorBase <|-- MapFieldGenerator
    FieldGeneratorBase <|-- EnumFieldGenerator
    FieldGeneratorBase <|-- CordFieldGenerator

Each concrete generator implements virtual methods like GenerateAccessorDeclarations(), GenerateAccessorDefinitions(), GenerateMergingCode(), and GenerateSwappingCode(). The message-level generator orchestrates all field generators, allocating hasbits, managing oneof unions, and producing the complete .pb.h and .pb.cc files.

The CppGenerator class itself supports multiple runtime modes through its Runtime enum:

enum class Runtime {
    kGoogle3,           // Internal google3 runtime
    kOpensource,        // Open-source runtime
    kOpensourceGoogle3  // Open-source with google3 paths
};

This allows the same generator to produce code for both Google's internal build system and the open-source release, adjusting #include paths accordingly. The opensource_runtime_ flag and runtime_include_base_ string control this behavior.

HPB: A New C++ API on upb

HPB (Header-based Protobuf) represents a third C++ path — modern C++ ergonomics backed by upb's lightweight runtime rather than the full C++ protobuf library.

The main header reveals a dual-backend design similar to Rust's:

#if HPB_INTERNAL_BACKEND == HPB_INTERNAL_BACKEND_UPB
#include "hpb/backend/upb/upb.h"
#elif HPB_INTERNAL_BACKEND == HPB_INTERNAL_BACKEND_CPP
#include "hpb/backend/cpp/cpp.h"
#else
#error hpb backend unknown
#endif

HPB's API uses arena-based message creation and pointer-based access:

template <typename T>
typename T::Proxy CreateMessage(Arena& arena) {
    return backend::CreateMessage<T>(arena);
}

The Ptr<T> and Proxy types serve similar roles to Rust's View and Mut — they provide safe access to arena-allocated messages without exposing raw pointers. DeepCopy operations are explicit, making ownership transfer visible in the API.

flowchart TD
    subgraph "Traditional C++ Protobuf"
        TC["Full C++ runtime<br/>~MB code size<br/>Global state<br/>Reflection built-in"]
    end
    
    subgraph "HPB"
        HPB_API["Modern C++ API<br/>Small code size<br/>No global state<br/>Opt-in reflection"]
        HPB_API --> UPB_BE["upb backend"]
        HPB_API --> CPP_BE["C++ backend"]
    end
    
    subgraph "Raw upb"
        RAW["C API<br/>Minimal code size<br/>Manual MiniTable management"]
    end

HPB is still evolving, but it represents the protobuf team's vision for what a modern C++ protobuf API should look like: arena-centric, explicit about ownership, and backed by the compact upb runtime rather than the heavyweight C++ library.

Tip: If you're starting a new C++ project that uses protobuf and doesn't need backward compatibility with existing .pb.h APIs, consider evaluating HPB. It's younger but significantly leaner.

Patterns Across Generators

Looking across all the generators, several patterns emerge:

  1. Backend abstraction: Both Rust and HPB support multiple runtime backends behind a unified API. This pattern is likely to spread as the protobuf team moves more languages toward upb.

  2. Proxy types for arena safety: Rust's View/Mut and HPB's Ptr/Proxy independently arrived at similar solutions for arena-owned data access.

  3. Strategy pattern for field types: The C++ generator's field generator hierarchy is the most explicit example, but every generator internally dispatches on field type.

  4. Editions support: Every modern generator implements GetMinimumEdition() / GetMaximumEdition() and uses the FeatureResolver infrastructure from Article 3.

In Article 7, we'll step back from implementation details and look at how protobuf maintains correctness across all these languages through its conformance test suite, failure tracking system, and CI infrastructure.