Read OSS

Navigating the Protobuf Monorepo: Architecture and Directory Guide

Beginner

Prerequisites

  • Basic understanding of what Protocol Buffers are and their purpose as a serialization format

Navigating the Protobuf Monorepo: Architecture and Directory Guide

The Protocol Buffers repository is one of the most consequential open-source projects in infrastructure software. It powers serialization across virtually every Google service and a significant fraction of the industry. Yet when you first git clone it, you're greeted with over 70 top-level directories, two distinct C runtimes, code generators for 10+ languages, and a build tool that spans both Bazel and CMake. This article is your map.

We'll orient you in the monorepo, explain the two core runtimes and why they both exist, walk through the compilation pipeline at a high level, and show you how each language runtime fits into the picture. By the end, you'll know exactly where to look when you want to understand any subsystem.

The Dual-World Architecture: C++ Protobuf vs μpb

The single most important architectural fact about this repository is that it contains two separate protobuf runtimes: the full C++ implementation and μpb (micro protobuf), a lightweight C implementation.

The C++ runtime lives in src/google/protobuf/ and provides the feature-rich experience most C++ developers know: generated message classes with typed accessors, full reflection, the Arena allocator, and high-performance tail-call table parsing. It's a large, sophisticated library optimized for throughput.

μpb lives in upb/ and takes the opposite approach. As the Readme explains, upb has "comparable speed to protobuf C++, but is an order of magnitude smaller in code size." It supports optional reflection, has no global state, and is specifically designed for embedding in dynamic language runtimes.

graph TD
    subgraph "C++ Protobuf Runtime"
        CPP[src/google/protobuf/]
        CPPGEN[C++ Generated Code]
        JAVA[Java Runtime]
        CSHARP[C# Runtime]
    end

    subgraph "μpb C Runtime"
        UPB[upb/]
        PY[Python C Extension]
        RB[Ruby C Extension]
        PHP[PHP C Extension]
        HPB[hpb/ - C++ API on upb]
    end

    subgraph "Dual-Kernel"
        RUST[Rust Runtime]
        RUST --> CPP
        RUST --> UPB
    end

    CPP --> CPPGEN
    CPP --> JAVA
    CPP --> CSHARP
    UPB --> PY
    UPB --> RB
    UPB --> PHP
    UPB --> HPB

The division is clear: Python, Ruby, and PHP all use upb as their backend through C extension modules. Java and C# are standalone implementations. Rust uniquely supports both backends through a kernel abstraction layer. And HPB is a newer C++ API built on top of upb, offering a third path for C++ users who want smaller binaries.

Tip: When debugging a Python protobuf issue, don't look in src/google/protobuf/ — the Python runtime is backed by upb, so the relevant C code lives in upb/ and python/.

Directory Map: Finding Your Way

Here's an annotated map of the most important top-level directories:

Directory Purpose
src/google/protobuf/ C++ protobuf runtime: messages, descriptors, reflection, arena, wire format
src/google/protobuf/compiler/ The protoc compiler: CLI, parser, and all built-in code generators
upb/ μpb C runtime: messages, MiniTable schema, arena, wire encoding/decoding
hpb/ Header-based Protobuf — a modern C++ API built on upb
python/ Python C extension wrapping upb
ruby/ Ruby C extension wrapping upb
php/ PHP C extension wrapping upb
java/ Java protobuf runtime (standalone, does not use upb)
csharp/ C# protobuf runtime (standalone)
rust/ Rust protobuf runtime with dual-kernel support
objectivec/ Objective-C runtime
conformance/ Cross-language conformance test suite
editions/ Protobuf Editions system (replacing proto2/proto3 syntax)
bazel/ Bazel build rules for protobuf

The C++ compiler code is nested under src/google/protobuf/compiler/, with each language getting its own subdirectory: cpp/, java/, python/, rust/, php/, ruby/, csharp/, objectivec/, and kotlin/.

The version.json file at the root tracks per-language versions for the release train. Each language can have its own version number (e.g., C++ 7.35-dev, Java 4.35-dev, Rust 4.35-dev), coordinated through a single protoc_version field. This is how the monorepo manages independent release cadences.

The Compilation Pipeline at 30,000 Feet

Every protobuf user starts with a .proto file and ends with generated code in their target language. The protoc compiler orchestrates that transformation. Here's the high-level flow:

flowchart LR
    A[".proto file"] --> B["Tokenizer<br/>(lexer)"]
    B --> C["Parser<br/>(AST builder)"]
    C --> D["FileDescriptorProto"]
    D --> E["DescriptorPool<br/>(validation + linking)"]
    E --> F["FileDescriptor<br/>(immutable schema)"]
    F --> G["CodeGenerator<br/>(language-specific)"]
    G --> H["Generated Source Files"]

The entry point is src/google/protobuf/compiler/main.cc. The ProtobufMain() function registers all built-in code generators and calls cli.Run():

// Proto2 C++
cpp::CppGenerator cpp_generator;
cli.RegisterGenerator("--cpp_out", "--cpp_opt", &cpp_generator,
                      "Generate C++ header and source.");

Each language gets a similar registration block — Java, Kotlin, Python, PHP, Ruby (including a new RBS type definition generator), C#, Objective-C, and Rust. After registration, cli.Run(argc, argv) handles argument parsing, .proto file loading, descriptor building, and code generation dispatch.

The pipeline has two extension points. First, the CodeGenerator abstract interface in code_generator.h that all built-in generators implement. Second, the plugin system described in plugin.h, where external generators communicate with protoc via CodeGeneratorRequest/CodeGeneratorResponse protobufs over stdin/stdout. The plugin mechanism is what enables third-party languages (Go, Dart, Swift, etc.) to have their own protoc plugins without modifying the core compiler.

Language Runtime Organization

Not all language runtimes are created equal. They fall into three distinct architectural categories:

graph TD
    subgraph "C Extension Wrappers (upb backend)"
        PY["python/<br/>message.c, descriptor_pool.c"]
        RB["ruby/<br/>ext/google/protobuf_c/"]
        PHP_RT["php/<br/>ext/google/protobuf/"]
    end

    subgraph "Standalone Implementations"
        JAVA["java/<br/>Pure Java runtime"]
        CS["csharp/<br/>Pure C# runtime"]
        OBJC["objectivec/<br/>Objective-C runtime"]
    end

    subgraph "Dual-Kernel"
        RUST_RT["rust/<br/>cpp_kernel/ + upb_kernel/"]
    end

    subgraph "New C++ on upb"
        HPB_RT["hpb/<br/>Modern C++ API"]
    end

C Extension wrappers (Python, Ruby, PHP): These languages have thin C extension modules that delegate to upb for all heavy lifting. Looking at python/message.c, you can see it directly includes upb headers like upb/message/message.h, upb/wire/decode.h, and upb/reflection/message.h. The Python objects are thin wrappers around upb_Message pointers.

Standalone implementations (Java, C#): These maintain their own complete runtime implementations. Java's runtime in java/ includes its own wire format codec, message builders, and reflection system, all implemented in pure Java.

Rust's dual-kernel approach is the most architecturally interesting. The rust/cpp_kernel/ and rust/upb_kernel/ directories provide alternative backends, allowing Rust protobuf to work with either the C++ runtime or upb underneath. As we'll explore in Article 6, this is accomplished through a proxy-based API design that abstracts over the differences.

HPB (hpb/) is a newer entry — a modern C++ API layer built on top of upb. It offers a cleaner C++ interface without the weight of the full C++ protobuf runtime, while still getting upb's small code size.

Build System and Dependencies

The protobuf repository uses Bazel as its primary build system, with CMake as a secondary option. The MODULE.bazel file defines the module as protobuf version 35.0-dev and requires Bazel 8.0.0 or later.

Key dependencies include:

Dependency Version Purpose
abseil-cpp 20250512.1 Core C++ utilities (strings, containers, synchronization)
zlib 1.3.1 Compression support
jsoncpp 1.9.6 JSON parsing
rules_java 8.6.1 Java build rules
rules_python 1.6.0 Python build rules
rules_rust 0.63.0 Rust build rules
flowchart TD
    PB["protobuf module"]
    PB --> ABSEIL["abseil-cpp"]
    PB --> ZLIB["zlib"]
    PB --> JSON["jsoncpp"]
    PB --> RJ["rules_java"]
    PB --> RP["rules_python"]
    PB --> RR["rules_rust"]
    PB --> RRB["rules_ruby"]
    PB --> CC["rules_cc"]
    PB --> SK["bazel_skylib"]

The CI infrastructure uses GitHub Actions with per-language workflow files. You'll find test_cpp.yml, test_java.yml, test_python.yml, test_rust.yml, and many more in .github/workflows/. The test_runner.yml orchestrates these, running on pushes to main, pull requests, and hourly scheduled runs.

Tip: If you're building protobuf locally, Bazel is strongly preferred. The CMakeLists.txt exists for downstream consumers who need CMake integration, but the canonical build uses Bazel targets throughout.

Orienting Yourself for What's Next

You now have a mental model of the protobuf monorepo: two runtimes (C++ and upb), a compiler with pluggable code generators, language runtimes in three architectural styles, and a Bazel-first build system. Every subsequent article in this series will build on this map.

In Article 2, we'll zoom into the protoc compiler itself — tracing the full path from a .proto file through lexing, parsing, descriptor validation, and code generation dispatch. We'll see how CommandLineInterface::Run() orchestrates a surprisingly complex 4000-line compilation pipeline.