Read OSS

Gatsby's Architecture: Navigating a 105-Package Monorepo

Intermediate

Prerequisites

  • Basic React knowledge
  • Familiarity with npm/yarn package management
  • Understanding of monorepo concepts

Gatsby's Architecture: Navigating a 105-Package Monorepo

Gatsby is one of the most ambitious open-source JavaScript projects ever built. Beneath the familiar gatsby build and gatsby develop commands lies a monorepo containing roughly 105 packages, spanning source plugins, transformer plugins, a GraphQL data layer, an XState-driven development server, and a full deployment adapter abstraction. Understanding how these pieces fit together is the key to contributing to—or deeply leveraging—the framework.

This article is the first in a five-part series. We'll start with a bird's-eye view of the repository structure, trace how a CLI command travels from global installation to project-local execution, and introduce the fundamental architectural split between the build pipeline and the develop server that shapes everything that follows.

Monorepo Structure: Lerna + Yarn Workspaces

Gatsby uses Lerna with Yarn Workspaces to manage its packages. The root configuration is minimal but tells you everything about the layout:

lerna.json

{
  "packages": ["packages/*"],
  "npmClient": "yarn",
  "useWorkspaces": true,
  "version": "independent"
}

The "version": "independent" line is significant—each package is versioned and published independently, which is essential when core framework changes shouldn't force version bumps in every ecosystem plugin. Yarn Workspaces (configured in the root package.json) handles symlinking so that packages can require each other without publishing to npm first.

graph TD
    subgraph "Core"
        gatsby["gatsby"]
        cli["gatsby-cli"]
        coreutils["gatsby-core-utils"]
        pluginutils["gatsby-plugin-utils"]
    end

    subgraph "Source Plugins"
        fs["gatsby-source-filesystem"]
        contentful["gatsby-source-contentful"]
        drupal["gatsby-source-drupal"]
        wordpress["gatsby-source-wordpress"]
    end

    subgraph "Transformer Plugins"
        remark["gatsby-transformer-remark"]
        sharp["gatsby-transformer-sharp"]
        yaml["gatsby-transformer-yaml"]
    end

    subgraph "Feature Plugins"
        image["gatsby-plugin-image"]
        mdx["gatsby-plugin-mdx"]
        feed["gatsby-plugin-feed"]
        offline["gatsby-plugin-offline"]
    end

    subgraph "Adapters"
        netlify["gatsby-adapter-netlify"]
    end

    cli --> gatsby
    gatsby --> coreutils
    gatsby --> pluginutils
    fs --> coreutils
    remark --> fs

The packages fall into three natural tiers:

Tier Examples Role
Core gatsby, gatsby-cli, gatsby-core-utils, gatsby-plugin-utils, gatsby-page-utils Framework runtime, CLI, shared utilities
Ecosystem Plugins gatsby-source-filesystem, gatsby-transformer-remark, gatsby-plugin-image Data sourcing, transformation, and features
Adapters & Utilities gatsby-adapter-netlify, gatsby-graphiql-explorer, create-gatsby Deployment adapters, tooling, scaffolding

Tip: If you're looking for "where does feature X live?", the naming convention is your friend. gatsby-source-* packages source data, gatsby-transformer-* packages transform nodes, and gatsby-plugin-* packages add build/runtime features. gatsby-adapter-* packages handle deployment platform specifics.

The CLI Delegation Pattern

One of the most elegant patterns in Gatsby's architecture is the two-phase CLI resolution. There are actually two CLIs: a globally installed gatsby-cli package and a project-local gatsby package. When you run gatsby build, the global CLI doesn't execute the build logic itself—it delegates to the project-local installation.

Phase 1: The Global Entry Point

The global entry point lives in packages/gatsby-cli/src/index.ts. Its responsibilities are narrow but critical:

  1. Node.js version validation (line 22–38): Ensures the runtime meets the minimum version
  2. Global error handlers: Sets up unhandledRejection and uncaughtException handlers
  3. CLI creation: Calls createCli(process.argv) at line 76
sequenceDiagram
    participant User
    participant GlobalCLI as gatsby-cli (global)
    participant Yargs
    participant LocalGatsby as gatsby (local)

    User->>GlobalCLI: gatsby build
    GlobalCLI->>GlobalCLI: Check Node.js version
    GlobalCLI->>GlobalCLI: Set up error handlers
    GlobalCLI->>Yargs: createCli(process.argv)
    Yargs->>GlobalCLI: resolveLocalCommand("build")
    GlobalCLI->>LocalGatsby: resolveCwd.silent("gatsby/dist/commands/build")
    LocalGatsby-->>GlobalCLI: build command handler
    GlobalCLI->>LocalGatsby: Execute build(args)

Phase 2: Local Command Resolution

The magic happens in packages/gatsby-cli/src/create-cli.ts. The resolveLocalCommand function uses resolveCwd.silent() to find the project-local gatsby installation:

const cmdPath =
  resolveCwd.silent(`gatsby/dist/commands/${command}`) ||
  // Old location of commands
  resolveCwd.silent(`gatsby/dist/utils/${command}`)

This pattern ensures that a globally installed gatsby-cli@5.x can work correctly with a project using gatsby@4.x in its node_modules. The global CLI is just a thin router; the actual command logic comes from whatever version of gatsby the project depends on.

The project-local gatsby package also has its own bin entry (packages/gatsby/cli.js)—a three-line file that simply requires ./dist/bin/gatsby.js. This is the fallback path when npx gatsby build runs directly.

Tip: When debugging CLI issues, always check which gatsby version resolveLocalCommand is resolving to. A mismatch between global CLI expectations and local package exports is a common source of confusing errors.

Build vs Develop: Two Architectures

Here's where Gatsby's design gets genuinely interesting. The gatsby build and gatsby develop commands don't just differ in output—they use fundamentally different architectural patterns.

gatsby build: Sequential Imperative Pipeline

The build command in packages/gatsby/src/commands/build.ts is a straightforward async function that runs a series of steps in order:

flowchart LR
    A[bootstrap] --> B[writeOutRequires]
    B --> C[Build JS Bundle]
    C --> D[Build HTML Renderer]
    D --> E[Run Queries]
    E --> F[Generate HTML]
    F --> G[onPostBuild]
    G --> H[Adapter Deploy]

Each step completes before the next begins. There's no concurrency, no event handling, no state machine. It's a classic build pipeline: load config → source data → build schema → create pages → bundle JS → render HTML → deploy.

gatsby develop: XState Reactive State Machine

The develop command couldn't be more different. It needs to handle an inherently reactive problem: files change, webhooks arrive, GraphQL mutations fire, and the dev server must respond to all of these events—sometimes simultaneously, sometimes while another rebuild is still in progress.

packages/gatsby/src/commands/develop.ts spawns a ControllableScript child process, and inside that child process (packages/gatsby/src/commands/develop-process.ts), an XState state machine is created and interpreted:

const machine = developMachine.withContext({
  program,
  parentSpan,
  app,
  reporter,
  pendingQueryRuns: new Set([`/`]),
  shouldRunInitialTypegen: true,
})

const service = interpret(machine)
service.start()

This isn't a cosmetic choice—it's an architectural necessity. We'll explore the state machine in depth in Part 3.

flowchart TD
    subgraph "Parent Process (develop.ts)"
        A[ControllableScript] -->|IPC| B[Child Process]
        A -->|heartbeat| C[Crash Detection]
    end

    subgraph "Child Process (develop-process.ts)"
        B --> D[XState developMachine]
        D --> E[initializing]
        E --> F[initializingData]
        F --> G[runningQueries]
        G --> H[startingDevServers]
        H --> I[waiting]
        I -->|file change| J[recompiling]
        I -->|node mutation| K[recreatingPages]
        J --> I
        K --> G
    end

The Heart of Gatsby: packages/gatsby/src/ Layout

The packages/gatsby/src/ directory is where the framework's core logic lives. It's large, but its internal structure follows clear patterns.

Directory Purpose
bootstrap/ Initialization pipeline: config loading, plugin loading, theme resolution
commands/ CLI command handlers: build.ts, develop.ts, serve.ts, clean.ts
services/ Discrete build stages as standalone functions: initialize, sourceNodes, buildSchema
state-machines/ XState machines for develop: develop/, data-layer/, query-running/, waiting/
redux/ Redux store, reducers, action creators, type definitions, persistence
schema/ GraphQL schema construction, inference, extensions, resolvers
query/ Query extraction, compilation, validation, and execution
datastore/ LMDB-backed persistent node storage
internal-plugins/ Plugins bundled with Gatsby that use its own plugin API
utils/ A large collection of utilities: webpack config, API runner, page data, adapters

The Bootstrap Orchestrator

The bootstrap sequence—the initialization pipeline shared by both build and develop—is defined in packages/gatsby/src/bootstrap/index.ts:

const context = {
  ...bootstrapContext,
  ...(await initialize(bootstrapContext)),
}
await customizeSchema(context)
await sourceNodes(context)
await buildSchema(context)
// ... createPages, extractQueries, etc.

Each function imported from ../services represents a discrete build stage. This service-function pattern is what makes the stages reusable—the same sourceNodes service is used in both the build pipeline and the develop state machine.

flowchart TD
    A[initialize] --> B[customizeSchema]
    B --> C[sourceNodes]
    C --> D[buildSchema]
    D --> E[createPages]
    E --> F[extractQueries]
    F --> G[writeOutRedirects]
    G --> H[postBootstrap]

    style A fill:#e1f5fe
    style C fill:#e8f5e9
    style D fill:#e8f5e9
    style E fill:#fff3e0
    style F fill:#fff3e0

What's Next

In the next article, we'll trace a gatsby build command from start to finish—following the data as it flows through the initialize service (Parcel-based compilation, config loading, theme resolution), through the API runner bridge that connects core to plugins, across four distinct webpack stages, and out to the final HTML files. We'll also dig into how the .cache/ directory enables incremental builds.