Protocol Buffers Source Code: A Map of the Territory
Prerequisites
- ›Basic familiarity with what Protocol Buffers are and how .proto files work
- ›Comfort reading C++ header files
Protocol Buffers Source Code: A Map of the Territory
The Protocol Buffers repository is one of the most impactful open-source projects in existence. Nearly every major distributed system at Google and beyond serializes data through protobuf, and the repository that produces the compiler and runtimes sprawls across ~200k lines of C++, plus substantial codebases in Java, Python, Ruby, PHP, Rust, C#, Objective-C, and Kotlin. If you've ever wanted to understand how protoc turns a .proto file into working code — or how the runtime parses a message in nanoseconds — this series will walk you through every layer.
This first article is your map. We'll establish the mental model, tour the directory structure, understand why there are two C runtimes, and trace the main entry point. By the end, you'll be able to open any file in the repository and know where it fits.
The Three-Layer Architecture
Every protobuf interaction flows through three conceptual layers. Understanding these layers is the single most important thing for navigating the codebase.
flowchart TB
subgraph Schema["Schema Layer"]
A[".proto files"] --> B["Descriptor Graph<br/>(FileDescriptor, Descriptor, FieldDescriptor)"]
end
subgraph Codegen["Code Generation Layer"]
B --> C["protoc + Language Generators"]
C --> D["Generated Source Files<br/>(.pb.h/.pb.cc, .java, .py, etc.)"]
end
subgraph Runtime["Runtime Layer"]
D --> E["Generated Code + Runtime Library"]
E --> F["Serialize / Parse / Reflect"]
end
The Schema Layer defines the language-neutral type system. A .proto file is parsed into a FileDescriptorProto, then built into an immutable FileDescriptor within a DescriptorPool. This descriptor graph — rooted in src/google/protobuf/descriptor.h — is the canonical in-memory representation of every message, field, enum, and service.
The Code Generation Layer reads descriptors and emits language-specific source files. The protoc compiler orchestrates this, dispatching to registered CodeGenerator implementations. Each generator knows how to emit idiomatic code for its target language.
The Runtime Layer is what end users actually link against. It provides base classes (MessageLite, Message), serialization/parsing logic, arena allocation, and reflection. Generated code inherits from these base classes and plugs into the runtime's parsing tables.
This separation is what makes protobuf language-neutral: the schema layer is shared, and each language gets its own generator and runtime.
Directory Structure Walkthrough
The repository is large, but its top-level layout follows the three-layer model cleanly.
| Directory | Layer | Purpose |
|---|---|---|
src/google/protobuf/ |
Schema + Runtime | Core C++ runtime, descriptor system, arena, parsing |
src/google/protobuf/compiler/ |
Code Generation | protoc CLI, parser, and all built-in language generators |
upb/ |
Runtime (C) | µpb — lightweight C runtime used by dynamic languages |
hpb/ |
Runtime (C++) | Ergonomic C++ wrapper over µpb |
upb_generator/ |
Code Generation | Code generator for µpb mini-tables |
hpb_generator/ |
Code Generation | Code generator for HPB C++ wrappers |
java/, python/, ruby/, php/, rust/, csharp/, objectivec/ |
Runtime | Per-language runtime libraries |
conformance/ |
Testing | Cross-language conformance test suite |
editions/ |
Schema | Editions feature definitions and test data |
docs/ |
Documentation | Design docs, upb guides |
pkg/ |
Build/Packaging | Distribution packaging, file list generation |
.github/workflows/ |
CI | 20+ GitHub Actions workflows |
The C++ compiler and runtime live under src/google/protobuf/. Within that, compiler/ holds the front-end (tokenizer, parser, CLI) and subdirectories for each language generator (compiler/cpp/, compiler/java/, compiler/python/, etc.).
Tip: The
compiler/subdirectory contains generators for all built-in languages, not just C++. When you seecompiler/rust/generator.h, that's the Rust code generator — still written in C++, still part of theprotocbinary.
Dual-Runtime Strategy: C++ vs. µpb
One of the most surprising things about this repository is that it contains two separate C/C++ runtimes. Understanding why is essential.
flowchart LR
subgraph Full["C++ Runtime (src/google/protobuf/)"]
direction TB
ML[MessageLite] --> M[Message]
M --> R[Reflection]
M --> AR[Arena Allocation]
M --> TC[TcParser]
end
subgraph Micro["µpb Runtime (upb/)"]
direction TB
UM[upb_Message] --> MT[upb_MiniTable]
UM --> UA[upb_Arena]
UM --> UD[upb_Decode / upb_Encode]
end
Full -.->|"Used by: C++ apps"| U1[C++ Users]
Micro -.->|"Wrapped by: PHP, Ruby, Python"| U2[Dynamic Languages]
The C++ runtime is a full-featured, user-facing library. It provides reflection, lazy fields, dynamic messages, a thread-safe arena, and the high-performance TcParser. It's what C++ applications link against directly.
µpb (micro protobuf, in upb/) is a minimal C kernel. It prioritizes small code size and a stable C ABI over features. As the docs explain in docs/upb/vs-cpp-protos.md:
- C++ protobuf is a user-level library designed for direct use by C++ applications
- µpb is designed primarily to be wrapped by other languages — a C kernel that forms the basis for building language-specific protobuf libraries
PHP, Ruby, and Python all use µpb as their serialization kernel, calling into it through FFI. This is why those language directories exist alongside upb/ — they contain the glue code that wraps µpb into idiomatic APIs.
The code size difference is dramatic. For a binary that parses and serializes descriptor.proto, µpb's .text section is 26 KiB versus 983 KiB for the full C++ runtime.
Entry Points: protoc and the Plugin System
The protoc binary starts in src/google/protobuf/compiler/main.cc. The ProtobufMain() function is surprisingly readable — it instantiates a CommandLineInterface, registers every built-in generator, and calls cli.Run():
flowchart TD
A["ProtobufMain()"] --> B["Create CommandLineInterface"]
B --> C["AllowPlugins('protoc-')"]
C --> D["Register 11 built-in generators"]
D --> E["cli.Run(argc, argv)"]
E --> F{"--X_out flag?"}
F -->|"Built-in"| G["Invoke registered CodeGenerator"]
F -->|"Unknown"| H["Find protoc-gen-X plugin binary"]
H --> I["Pipe CodeGeneratorRequest via stdin"]
I --> J["Read CodeGeneratorResponse from stdout"]
The 11 registered generators are:
| Flag | Generator | Language |
|---|---|---|
--cpp_out |
CppGenerator |
C++ |
--java_out |
JavaGenerator |
Java |
--kotlin_out |
KotlinGenerator |
Kotlin |
--python_out |
Generator |
Python |
--pyi_out |
PyiGenerator |
Python stubs |
--php_out |
Generator |
PHP |
--ruby_out |
Generator |
Ruby |
--rbs_out |
RBSGenerator |
Ruby type defs |
--csharp_out |
Generator |
C# |
--objc_out |
ObjectiveCGenerator |
Objective-C |
--rust_out |
RustGenerator |
Rust |
For languages not built in — Go, Dart, or any third-party language — protoc uses the plugin protocol. When it sees an unrecognized --foo_out flag, it searches PATH for protoc-gen-foo, serializes a CodeGeneratorRequest to its stdin, and reads a CodeGeneratorResponse from stdout. The PluginMain() function in plugin.h provides a turnkey entry point for writing such plugins in C++.
Tip: The plugin protocol means you can write a protobuf code generator in any language, without any C++ dependency. Your binary just needs to read/write protobuf messages on stdio. This is why
protoc-gen-gois written in Go.
Navigating the Codebase
A few practical patterns will help you move quickly through the repo.
Version tracking uses version.json, which tracks independent version numbers per language. At the time of this writing, the C++ runtime is at 7.35-dev, Java at 4.35-dev, and protoc itself at 35-dev. These diverge because language runtimes evolve at different paces.
Build system duality is a key characteristic. Bazel is the canonical build system, defined in MODULE.bazel with all external dependencies (Abseil, rules_cc, zlib, etc.). CMake is supported as a secondary system for ecosystem compatibility. The bridge between them is the gen_file_lists rule in pkg/BUILD.bazel, which generates file lists from Bazel targets that CMake then consumes.
flowchart LR
A["Bazel BUILD files<br/>(canonical)"] --> B["gen_file_lists rule<br/>pkg/BUILD.bazel"]
B --> C["src_file_lists.cmake"]
C --> D["CMakeLists.txt<br/>(secondary)"]
File naming conventions are consistent:
_litesuffix means MessageLite-only (no reflection)internal/subdirectories contain implementation detailsport_def.inc/port_undef.incare macro guard pairs that bracket platform-specific defines*.protofiles insrc/google/protobuf/are the well-known types and internal schemas
Conformance tests in conformance/ provide the single source of truth for correctness across all languages. Each language has a failure list file (e.g., failure_list_python.txt) that explicitly documents known divergences — a surprisingly effective contract mechanism.
What's Next
With this map in hand, you're ready to dive deeper. In the next article, we'll follow the full compilation pipeline inside protoc: from the hand-written tokenizer that lexes .proto files, through the recursive-descent parser that builds FileDescriptorProto messages, into the DescriptorPool that resolves cross-file references, and finally out through the CodeGenerator interface that produces language-specific source code. We'll also unpack the beautiful self-referential puzzle of descriptor.proto — the proto file that describes all proto files.