Read OSS

Protocol Buffers Source Code: A Map of the Territory

Intermediate

Prerequisites

  • Basic familiarity with what Protocol Buffers are and how .proto files work
  • Comfort reading C++ header files

Protocol Buffers Source Code: A Map of the Territory

The Protocol Buffers repository is one of the most impactful open-source projects in existence. Nearly every major distributed system at Google and beyond serializes data through protobuf, and the repository that produces the compiler and runtimes sprawls across ~200k lines of C++, plus substantial codebases in Java, Python, Ruby, PHP, Rust, C#, Objective-C, and Kotlin. If you've ever wanted to understand how protoc turns a .proto file into working code — or how the runtime parses a message in nanoseconds — this series will walk you through every layer.

This first article is your map. We'll establish the mental model, tour the directory structure, understand why there are two C runtimes, and trace the main entry point. By the end, you'll be able to open any file in the repository and know where it fits.

The Three-Layer Architecture

Every protobuf interaction flows through three conceptual layers. Understanding these layers is the single most important thing for navigating the codebase.

flowchart TB
    subgraph Schema["Schema Layer"]
        A[".proto files"] --> B["Descriptor Graph<br/>(FileDescriptor, Descriptor, FieldDescriptor)"]
    end
    subgraph Codegen["Code Generation Layer"]
        B --> C["protoc + Language Generators"]
        C --> D["Generated Source Files<br/>(.pb.h/.pb.cc, .java, .py, etc.)"]
    end
    subgraph Runtime["Runtime Layer"]
        D --> E["Generated Code + Runtime Library"]
        E --> F["Serialize / Parse / Reflect"]
    end

The Schema Layer defines the language-neutral type system. A .proto file is parsed into a FileDescriptorProto, then built into an immutable FileDescriptor within a DescriptorPool. This descriptor graph — rooted in src/google/protobuf/descriptor.h — is the canonical in-memory representation of every message, field, enum, and service.

The Code Generation Layer reads descriptors and emits language-specific source files. The protoc compiler orchestrates this, dispatching to registered CodeGenerator implementations. Each generator knows how to emit idiomatic code for its target language.

The Runtime Layer is what end users actually link against. It provides base classes (MessageLite, Message), serialization/parsing logic, arena allocation, and reflection. Generated code inherits from these base classes and plugs into the runtime's parsing tables.

This separation is what makes protobuf language-neutral: the schema layer is shared, and each language gets its own generator and runtime.

Directory Structure Walkthrough

The repository is large, but its top-level layout follows the three-layer model cleanly.

Directory Layer Purpose
src/google/protobuf/ Schema + Runtime Core C++ runtime, descriptor system, arena, parsing
src/google/protobuf/compiler/ Code Generation protoc CLI, parser, and all built-in language generators
upb/ Runtime (C) µpb — lightweight C runtime used by dynamic languages
hpb/ Runtime (C++) Ergonomic C++ wrapper over µpb
upb_generator/ Code Generation Code generator for µpb mini-tables
hpb_generator/ Code Generation Code generator for HPB C++ wrappers
java/, python/, ruby/, php/, rust/, csharp/, objectivec/ Runtime Per-language runtime libraries
conformance/ Testing Cross-language conformance test suite
editions/ Schema Editions feature definitions and test data
docs/ Documentation Design docs, upb guides
pkg/ Build/Packaging Distribution packaging, file list generation
.github/workflows/ CI 20+ GitHub Actions workflows

The C++ compiler and runtime live under src/google/protobuf/. Within that, compiler/ holds the front-end (tokenizer, parser, CLI) and subdirectories for each language generator (compiler/cpp/, compiler/java/, compiler/python/, etc.).

Tip: The compiler/ subdirectory contains generators for all built-in languages, not just C++. When you see compiler/rust/generator.h, that's the Rust code generator — still written in C++, still part of the protoc binary.

Dual-Runtime Strategy: C++ vs. µpb

One of the most surprising things about this repository is that it contains two separate C/C++ runtimes. Understanding why is essential.

flowchart LR
    subgraph Full["C++ Runtime (src/google/protobuf/)"]
        direction TB
        ML[MessageLite] --> M[Message]
        M --> R[Reflection]
        M --> AR[Arena Allocation]
        M --> TC[TcParser]
    end
    subgraph Micro["µpb Runtime (upb/)"]
        direction TB
        UM[upb_Message] --> MT[upb_MiniTable]
        UM --> UA[upb_Arena]
        UM --> UD[upb_Decode / upb_Encode]
    end
    Full -.->|"Used by: C++ apps"| U1[C++ Users]
    Micro -.->|"Wrapped by: PHP, Ruby, Python"| U2[Dynamic Languages]

The C++ runtime is a full-featured, user-facing library. It provides reflection, lazy fields, dynamic messages, a thread-safe arena, and the high-performance TcParser. It's what C++ applications link against directly.

µpb (micro protobuf, in upb/) is a minimal C kernel. It prioritizes small code size and a stable C ABI over features. As the docs explain in docs/upb/vs-cpp-protos.md:

  • C++ protobuf is a user-level library designed for direct use by C++ applications
  • µpb is designed primarily to be wrapped by other languages — a C kernel that forms the basis for building language-specific protobuf libraries

PHP, Ruby, and Python all use µpb as their serialization kernel, calling into it through FFI. This is why those language directories exist alongside upb/ — they contain the glue code that wraps µpb into idiomatic APIs.

The code size difference is dramatic. For a binary that parses and serializes descriptor.proto, µpb's .text section is 26 KiB versus 983 KiB for the full C++ runtime.

Entry Points: protoc and the Plugin System

The protoc binary starts in src/google/protobuf/compiler/main.cc. The ProtobufMain() function is surprisingly readable — it instantiates a CommandLineInterface, registers every built-in generator, and calls cli.Run():

flowchart TD
    A["ProtobufMain()"] --> B["Create CommandLineInterface"]
    B --> C["AllowPlugins('protoc-')"]
    C --> D["Register 11 built-in generators"]
    D --> E["cli.Run(argc, argv)"]
    E --> F{"--X_out flag?"}
    F -->|"Built-in"| G["Invoke registered CodeGenerator"]
    F -->|"Unknown"| H["Find protoc-gen-X plugin binary"]
    H --> I["Pipe CodeGeneratorRequest via stdin"]
    I --> J["Read CodeGeneratorResponse from stdout"]

The 11 registered generators are:

Flag Generator Language
--cpp_out CppGenerator C++
--java_out JavaGenerator Java
--kotlin_out KotlinGenerator Kotlin
--python_out Generator Python
--pyi_out PyiGenerator Python stubs
--php_out Generator PHP
--ruby_out Generator Ruby
--rbs_out RBSGenerator Ruby type defs
--csharp_out Generator C#
--objc_out ObjectiveCGenerator Objective-C
--rust_out RustGenerator Rust

For languages not built in — Go, Dart, or any third-party language — protoc uses the plugin protocol. When it sees an unrecognized --foo_out flag, it searches PATH for protoc-gen-foo, serializes a CodeGeneratorRequest to its stdin, and reads a CodeGeneratorResponse from stdout. The PluginMain() function in plugin.h provides a turnkey entry point for writing such plugins in C++.

Tip: The plugin protocol means you can write a protobuf code generator in any language, without any C++ dependency. Your binary just needs to read/write protobuf messages on stdio. This is why protoc-gen-go is written in Go.

A few practical patterns will help you move quickly through the repo.

Version tracking uses version.json, which tracks independent version numbers per language. At the time of this writing, the C++ runtime is at 7.35-dev, Java at 4.35-dev, and protoc itself at 35-dev. These diverge because language runtimes evolve at different paces.

Build system duality is a key characteristic. Bazel is the canonical build system, defined in MODULE.bazel with all external dependencies (Abseil, rules_cc, zlib, etc.). CMake is supported as a secondary system for ecosystem compatibility. The bridge between them is the gen_file_lists rule in pkg/BUILD.bazel, which generates file lists from Bazel targets that CMake then consumes.

flowchart LR
    A["Bazel BUILD files<br/>(canonical)"] --> B["gen_file_lists rule<br/>pkg/BUILD.bazel"]
    B --> C["src_file_lists.cmake"]
    C --> D["CMakeLists.txt<br/>(secondary)"]

File naming conventions are consistent:

  • _lite suffix means MessageLite-only (no reflection)
  • internal/ subdirectories contain implementation details
  • port_def.inc / port_undef.inc are macro guard pairs that bracket platform-specific defines
  • *.proto files in src/google/protobuf/ are the well-known types and internal schemas

Conformance tests in conformance/ provide the single source of truth for correctness across all languages. Each language has a failure list file (e.g., failure_list_python.txt) that explicitly documents known divergences — a surprisingly effective contract mechanism.

What's Next

With this map in hand, you're ready to dive deeper. In the next article, we'll follow the full compilation pipeline inside protoc: from the hand-written tokenizer that lexes .proto files, through the recursive-descent parser that builds FileDescriptorProto messages, into the DescriptorPool that resolves cross-file references, and finally out through the CodeGenerator interface that produces language-specific source code. We'll also unpack the beautiful self-referential puzzle of descriptor.proto — the proto file that describes all proto files.