Gatsby's Architecture: Navigating a 105-Package Monorepo
Prerequisites
- ›Basic React knowledge
- ›Familiarity with npm/yarn package management
- ›Understanding of monorepo concepts
Gatsby's Architecture: Navigating a 105-Package Monorepo
Gatsby is one of the most ambitious open-source JavaScript projects ever built. Beneath the familiar gatsby build and gatsby develop commands lies a monorepo containing roughly 105 packages, spanning source plugins, transformer plugins, a GraphQL data layer, an XState-driven development server, and a full deployment adapter abstraction. Understanding how these pieces fit together is the key to contributing to—or deeply leveraging—the framework.
This article is the first in a five-part series. We'll start with a bird's-eye view of the repository structure, trace how a CLI command travels from global installation to project-local execution, and introduce the fundamental architectural split between the build pipeline and the develop server that shapes everything that follows.
Monorepo Structure: Lerna + Yarn Workspaces
Gatsby uses Lerna with Yarn Workspaces to manage its packages. The root configuration is minimal but tells you everything about the layout:
{
"packages": ["packages/*"],
"npmClient": "yarn",
"useWorkspaces": true,
"version": "independent"
}
The "version": "independent" line is significant—each package is versioned and published independently, which is essential when core framework changes shouldn't force version bumps in every ecosystem plugin. Yarn Workspaces (configured in the root package.json) handles symlinking so that packages can require each other without publishing to npm first.
graph TD
subgraph "Core"
gatsby["gatsby"]
cli["gatsby-cli"]
coreutils["gatsby-core-utils"]
pluginutils["gatsby-plugin-utils"]
end
subgraph "Source Plugins"
fs["gatsby-source-filesystem"]
contentful["gatsby-source-contentful"]
drupal["gatsby-source-drupal"]
wordpress["gatsby-source-wordpress"]
end
subgraph "Transformer Plugins"
remark["gatsby-transformer-remark"]
sharp["gatsby-transformer-sharp"]
yaml["gatsby-transformer-yaml"]
end
subgraph "Feature Plugins"
image["gatsby-plugin-image"]
mdx["gatsby-plugin-mdx"]
feed["gatsby-plugin-feed"]
offline["gatsby-plugin-offline"]
end
subgraph "Adapters"
netlify["gatsby-adapter-netlify"]
end
cli --> gatsby
gatsby --> coreutils
gatsby --> pluginutils
fs --> coreutils
remark --> fs
The packages fall into three natural tiers:
| Tier | Examples | Role |
|---|---|---|
| Core | gatsby, gatsby-cli, gatsby-core-utils, gatsby-plugin-utils, gatsby-page-utils |
Framework runtime, CLI, shared utilities |
| Ecosystem Plugins | gatsby-source-filesystem, gatsby-transformer-remark, gatsby-plugin-image |
Data sourcing, transformation, and features |
| Adapters & Utilities | gatsby-adapter-netlify, gatsby-graphiql-explorer, create-gatsby |
Deployment adapters, tooling, scaffolding |
Tip: If you're looking for "where does feature X live?", the naming convention is your friend.
gatsby-source-*packages source data,gatsby-transformer-*packages transform nodes, andgatsby-plugin-*packages add build/runtime features.gatsby-adapter-*packages handle deployment platform specifics.
The CLI Delegation Pattern
One of the most elegant patterns in Gatsby's architecture is the two-phase CLI resolution. There are actually two CLIs: a globally installed gatsby-cli package and a project-local gatsby package. When you run gatsby build, the global CLI doesn't execute the build logic itself—it delegates to the project-local installation.
Phase 1: The Global Entry Point
The global entry point lives in packages/gatsby-cli/src/index.ts. Its responsibilities are narrow but critical:
- Node.js version validation (line 22–38): Ensures the runtime meets the minimum version
- Global error handlers: Sets up
unhandledRejectionanduncaughtExceptionhandlers - CLI creation: Calls
createCli(process.argv)at line 76
sequenceDiagram
participant User
participant GlobalCLI as gatsby-cli (global)
participant Yargs
participant LocalGatsby as gatsby (local)
User->>GlobalCLI: gatsby build
GlobalCLI->>GlobalCLI: Check Node.js version
GlobalCLI->>GlobalCLI: Set up error handlers
GlobalCLI->>Yargs: createCli(process.argv)
Yargs->>GlobalCLI: resolveLocalCommand("build")
GlobalCLI->>LocalGatsby: resolveCwd.silent("gatsby/dist/commands/build")
LocalGatsby-->>GlobalCLI: build command handler
GlobalCLI->>LocalGatsby: Execute build(args)
Phase 2: Local Command Resolution
The magic happens in packages/gatsby-cli/src/create-cli.ts. The resolveLocalCommand function uses resolveCwd.silent() to find the project-local gatsby installation:
const cmdPath =
resolveCwd.silent(`gatsby/dist/commands/${command}`) ||
// Old location of commands
resolveCwd.silent(`gatsby/dist/utils/${command}`)
This pattern ensures that a globally installed gatsby-cli@5.x can work correctly with a project using gatsby@4.x in its node_modules. The global CLI is just a thin router; the actual command logic comes from whatever version of gatsby the project depends on.
The project-local gatsby package also has its own bin entry (packages/gatsby/cli.js)—a three-line file that simply requires ./dist/bin/gatsby.js. This is the fallback path when npx gatsby build runs directly.
Tip: When debugging CLI issues, always check which
gatsbyversionresolveLocalCommandis resolving to. A mismatch between global CLI expectations and local package exports is a common source of confusing errors.
Build vs Develop: Two Architectures
Here's where Gatsby's design gets genuinely interesting. The gatsby build and gatsby develop commands don't just differ in output—they use fundamentally different architectural patterns.
gatsby build: Sequential Imperative Pipeline
The build command in packages/gatsby/src/commands/build.ts is a straightforward async function that runs a series of steps in order:
flowchart LR
A[bootstrap] --> B[writeOutRequires]
B --> C[Build JS Bundle]
C --> D[Build HTML Renderer]
D --> E[Run Queries]
E --> F[Generate HTML]
F --> G[onPostBuild]
G --> H[Adapter Deploy]
Each step completes before the next begins. There's no concurrency, no event handling, no state machine. It's a classic build pipeline: load config → source data → build schema → create pages → bundle JS → render HTML → deploy.
gatsby develop: XState Reactive State Machine
The develop command couldn't be more different. It needs to handle an inherently reactive problem: files change, webhooks arrive, GraphQL mutations fire, and the dev server must respond to all of these events—sometimes simultaneously, sometimes while another rebuild is still in progress.
packages/gatsby/src/commands/develop.ts spawns a ControllableScript child process, and inside that child process (packages/gatsby/src/commands/develop-process.ts), an XState state machine is created and interpreted:
const machine = developMachine.withContext({
program,
parentSpan,
app,
reporter,
pendingQueryRuns: new Set([`/`]),
shouldRunInitialTypegen: true,
})
const service = interpret(machine)
service.start()
This isn't a cosmetic choice—it's an architectural necessity. We'll explore the state machine in depth in Part 3.
flowchart TD
subgraph "Parent Process (develop.ts)"
A[ControllableScript] -->|IPC| B[Child Process]
A -->|heartbeat| C[Crash Detection]
end
subgraph "Child Process (develop-process.ts)"
B --> D[XState developMachine]
D --> E[initializing]
E --> F[initializingData]
F --> G[runningQueries]
G --> H[startingDevServers]
H --> I[waiting]
I -->|file change| J[recompiling]
I -->|node mutation| K[recreatingPages]
J --> I
K --> G
end
The Heart of Gatsby: packages/gatsby/src/ Layout
The packages/gatsby/src/ directory is where the framework's core logic lives. It's large, but its internal structure follows clear patterns.
| Directory | Purpose |
|---|---|
bootstrap/ |
Initialization pipeline: config loading, plugin loading, theme resolution |
commands/ |
CLI command handlers: build.ts, develop.ts, serve.ts, clean.ts |
services/ |
Discrete build stages as standalone functions: initialize, sourceNodes, buildSchema |
state-machines/ |
XState machines for develop: develop/, data-layer/, query-running/, waiting/ |
redux/ |
Redux store, reducers, action creators, type definitions, persistence |
schema/ |
GraphQL schema construction, inference, extensions, resolvers |
query/ |
Query extraction, compilation, validation, and execution |
datastore/ |
LMDB-backed persistent node storage |
internal-plugins/ |
Plugins bundled with Gatsby that use its own plugin API |
utils/ |
A large collection of utilities: webpack config, API runner, page data, adapters |
The Bootstrap Orchestrator
The bootstrap sequence—the initialization pipeline shared by both build and develop—is defined in packages/gatsby/src/bootstrap/index.ts:
const context = {
...bootstrapContext,
...(await initialize(bootstrapContext)),
}
await customizeSchema(context)
await sourceNodes(context)
await buildSchema(context)
// ... createPages, extractQueries, etc.
Each function imported from ../services represents a discrete build stage. This service-function pattern is what makes the stages reusable—the same sourceNodes service is used in both the build pipeline and the develop state machine.
flowchart TD
A[initialize] --> B[customizeSchema]
B --> C[sourceNodes]
C --> D[buildSchema]
D --> E[createPages]
E --> F[extractQueries]
F --> G[writeOutRedirects]
G --> H[postBootstrap]
style A fill:#e1f5fe
style C fill:#e8f5e9
style D fill:#e8f5e9
style E fill:#fff3e0
style F fill:#fff3e0
What's Next
In the next article, we'll trace a gatsby build command from start to finish—following the data as it flows through the initialize service (Parcel-based compilation, config loading, theme resolution), through the API runner bridge that connects core to plugins, across four distinct webpack stages, and out to the final HTML files. We'll also dig into how the .cache/ directory enables incremental builds.