Navigating the Protobuf Monorepo: Architecture and Directory Guide
Prerequisites
- ›Basic understanding of what Protocol Buffers are and their purpose as a serialization format
Navigating the Protobuf Monorepo: Architecture and Directory Guide
The Protocol Buffers repository is one of the most consequential open-source projects in infrastructure software. It powers serialization across virtually every Google service and a significant fraction of the industry. Yet when you first git clone it, you're greeted with over 70 top-level directories, two distinct C runtimes, code generators for 10+ languages, and a build tool that spans both Bazel and CMake. This article is your map.
We'll orient you in the monorepo, explain the two core runtimes and why they both exist, walk through the compilation pipeline at a high level, and show you how each language runtime fits into the picture. By the end, you'll know exactly where to look when you want to understand any subsystem.
The Dual-World Architecture: C++ Protobuf vs μpb
The single most important architectural fact about this repository is that it contains two separate protobuf runtimes: the full C++ implementation and μpb (micro protobuf), a lightweight C implementation.
The C++ runtime lives in src/google/protobuf/ and provides the feature-rich experience most C++ developers know: generated message classes with typed accessors, full reflection, the Arena allocator, and high-performance tail-call table parsing. It's a large, sophisticated library optimized for throughput.
μpb lives in upb/ and takes the opposite approach. As the Readme explains, upb has "comparable speed to protobuf C++, but is an order of magnitude smaller in code size." It supports optional reflection, has no global state, and is specifically designed for embedding in dynamic language runtimes.
graph TD
subgraph "C++ Protobuf Runtime"
CPP[src/google/protobuf/]
CPPGEN[C++ Generated Code]
JAVA[Java Runtime]
CSHARP[C# Runtime]
end
subgraph "μpb C Runtime"
UPB[upb/]
PY[Python C Extension]
RB[Ruby C Extension]
PHP[PHP C Extension]
HPB[hpb/ - C++ API on upb]
end
subgraph "Dual-Kernel"
RUST[Rust Runtime]
RUST --> CPP
RUST --> UPB
end
CPP --> CPPGEN
CPP --> JAVA
CPP --> CSHARP
UPB --> PY
UPB --> RB
UPB --> PHP
UPB --> HPB
The division is clear: Python, Ruby, and PHP all use upb as their backend through C extension modules. Java and C# are standalone implementations. Rust uniquely supports both backends through a kernel abstraction layer. And HPB is a newer C++ API built on top of upb, offering a third path for C++ users who want smaller binaries.
Tip: When debugging a Python protobuf issue, don't look in
src/google/protobuf/— the Python runtime is backed by upb, so the relevant C code lives inupb/andpython/.
Directory Map: Finding Your Way
Here's an annotated map of the most important top-level directories:
| Directory | Purpose |
|---|---|
src/google/protobuf/ |
C++ protobuf runtime: messages, descriptors, reflection, arena, wire format |
src/google/protobuf/compiler/ |
The protoc compiler: CLI, parser, and all built-in code generators |
upb/ |
μpb C runtime: messages, MiniTable schema, arena, wire encoding/decoding |
hpb/ |
Header-based Protobuf — a modern C++ API built on upb |
python/ |
Python C extension wrapping upb |
ruby/ |
Ruby C extension wrapping upb |
php/ |
PHP C extension wrapping upb |
java/ |
Java protobuf runtime (standalone, does not use upb) |
csharp/ |
C# protobuf runtime (standalone) |
rust/ |
Rust protobuf runtime with dual-kernel support |
objectivec/ |
Objective-C runtime |
conformance/ |
Cross-language conformance test suite |
editions/ |
Protobuf Editions system (replacing proto2/proto3 syntax) |
bazel/ |
Bazel build rules for protobuf |
The C++ compiler code is nested under src/google/protobuf/compiler/, with each language getting its own subdirectory: cpp/, java/, python/, rust/, php/, ruby/, csharp/, objectivec/, and kotlin/.
The version.json file at the root tracks per-language versions for the release train. Each language can have its own version number (e.g., C++ 7.35-dev, Java 4.35-dev, Rust 4.35-dev), coordinated through a single protoc_version field. This is how the monorepo manages independent release cadences.
The Compilation Pipeline at 30,000 Feet
Every protobuf user starts with a .proto file and ends with generated code in their target language. The protoc compiler orchestrates that transformation. Here's the high-level flow:
flowchart LR
A[".proto file"] --> B["Tokenizer<br/>(lexer)"]
B --> C["Parser<br/>(AST builder)"]
C --> D["FileDescriptorProto"]
D --> E["DescriptorPool<br/>(validation + linking)"]
E --> F["FileDescriptor<br/>(immutable schema)"]
F --> G["CodeGenerator<br/>(language-specific)"]
G --> H["Generated Source Files"]
The entry point is src/google/protobuf/compiler/main.cc. The ProtobufMain() function registers all built-in code generators and calls cli.Run():
// Proto2 C++
cpp::CppGenerator cpp_generator;
cli.RegisterGenerator("--cpp_out", "--cpp_opt", &cpp_generator,
"Generate C++ header and source.");
Each language gets a similar registration block — Java, Kotlin, Python, PHP, Ruby (including a new RBS type definition generator), C#, Objective-C, and Rust. After registration, cli.Run(argc, argv) handles argument parsing, .proto file loading, descriptor building, and code generation dispatch.
The pipeline has two extension points. First, the CodeGenerator abstract interface in code_generator.h that all built-in generators implement. Second, the plugin system described in plugin.h, where external generators communicate with protoc via CodeGeneratorRequest/CodeGeneratorResponse protobufs over stdin/stdout. The plugin mechanism is what enables third-party languages (Go, Dart, Swift, etc.) to have their own protoc plugins without modifying the core compiler.
Language Runtime Organization
Not all language runtimes are created equal. They fall into three distinct architectural categories:
graph TD
subgraph "C Extension Wrappers (upb backend)"
PY["python/<br/>message.c, descriptor_pool.c"]
RB["ruby/<br/>ext/google/protobuf_c/"]
PHP_RT["php/<br/>ext/google/protobuf/"]
end
subgraph "Standalone Implementations"
JAVA["java/<br/>Pure Java runtime"]
CS["csharp/<br/>Pure C# runtime"]
OBJC["objectivec/<br/>Objective-C runtime"]
end
subgraph "Dual-Kernel"
RUST_RT["rust/<br/>cpp_kernel/ + upb_kernel/"]
end
subgraph "New C++ on upb"
HPB_RT["hpb/<br/>Modern C++ API"]
end
C Extension wrappers (Python, Ruby, PHP): These languages have thin C extension modules that delegate to upb for all heavy lifting. Looking at python/message.c, you can see it directly includes upb headers like upb/message/message.h, upb/wire/decode.h, and upb/reflection/message.h. The Python objects are thin wrappers around upb_Message pointers.
Standalone implementations (Java, C#): These maintain their own complete runtime implementations. Java's runtime in java/ includes its own wire format codec, message builders, and reflection system, all implemented in pure Java.
Rust's dual-kernel approach is the most architecturally interesting. The rust/cpp_kernel/ and rust/upb_kernel/ directories provide alternative backends, allowing Rust protobuf to work with either the C++ runtime or upb underneath. As we'll explore in Article 6, this is accomplished through a proxy-based API design that abstracts over the differences.
HPB (hpb/) is a newer entry — a modern C++ API layer built on top of upb. It offers a cleaner C++ interface without the weight of the full C++ protobuf runtime, while still getting upb's small code size.
Build System and Dependencies
The protobuf repository uses Bazel as its primary build system, with CMake as a secondary option. The MODULE.bazel file defines the module as protobuf version 35.0-dev and requires Bazel 8.0.0 or later.
Key dependencies include:
| Dependency | Version | Purpose |
|---|---|---|
abseil-cpp |
20250512.1 | Core C++ utilities (strings, containers, synchronization) |
zlib |
1.3.1 | Compression support |
jsoncpp |
1.9.6 | JSON parsing |
rules_java |
8.6.1 | Java build rules |
rules_python |
1.6.0 | Python build rules |
rules_rust |
0.63.0 | Rust build rules |
flowchart TD
PB["protobuf module"]
PB --> ABSEIL["abseil-cpp"]
PB --> ZLIB["zlib"]
PB --> JSON["jsoncpp"]
PB --> RJ["rules_java"]
PB --> RP["rules_python"]
PB --> RR["rules_rust"]
PB --> RRB["rules_ruby"]
PB --> CC["rules_cc"]
PB --> SK["bazel_skylib"]
The CI infrastructure uses GitHub Actions with per-language workflow files. You'll find test_cpp.yml, test_java.yml, test_python.yml, test_rust.yml, and many more in .github/workflows/. The test_runner.yml orchestrates these, running on pushes to main, pull requests, and hourly scheduled runs.
Tip: If you're building protobuf locally, Bazel is strongly preferred. The
CMakeLists.txtexists for downstream consumers who need CMake integration, but the canonical build uses Bazel targets throughout.
Orienting Yourself for What's Next
You now have a mental model of the protobuf monorepo: two runtimes (C++ and upb), a compiler with pluggable code generators, language runtimes in three architectural styles, and a Bazel-first build system. Every subsequent article in this series will build on this map.
In Article 2, we'll zoom into the protoc compiler itself — tracing the full path from a .proto file through lexing, parsing, descriptor validation, and code generation dispatch. We'll see how CommandLineInterface::Run() orchestrates a surprisingly complex 4000-line compilation pipeline.