Read OSS

Following a `gatsby build` from CLI to HTML: The Bootstrap Pipeline

Intermediate

Prerequisites

  • Article 1: Architecture and Monorepo Overview
  • Basic Webpack concepts (loaders, plugins)
  • Understanding of build pipelines

Following a gatsby build from CLI to HTML: The Bootstrap Pipeline

Now that we've oriented ourselves in the monorepo, let's trace the complete execution of gatsby build—from the moment the command handler fires to the final HTML files landing in public/. This is a long journey through dozens of files, but understanding it is essential because every feature in Gatsby ultimately expresses itself as a stage in this pipeline.

The Build Command Pipeline

The build command handler in packages/gatsby/src/commands/build.ts is exported as a single async function. Its structure is linear—each stage completes before the next begins.

flowchart TD
    A["bootstrap()"] --> B["onPreBuild hook"]
    B --> C["writeOutRequires()"]
    C --> D["buildProductionBundle()"]
    D --> E["buildRenderer()"]
    E --> F["preparePageTemplateConfigs()"]
    F --> G["Build Rendering Engines"]
    G --> H["calculateDirtyQueries()"]
    H --> I["Run Queries"]
    I --> J["buildHTMLPagesAndDeleteStaleArtifacts()"]
    J --> K["onPostBuild hook"]
    K --> L["adapterManager.adapt()"]

The pipeline can be divided into three macro-phases:

  1. Bootstrap (lines 127–131): Initialize plugins, source data, build schema, create pages, extract queries
  2. Bundle (lines 153–295): Four webpack compilations producing JS bundles and HTML renderers
  3. Render (lines 302–511): Execute queries, generate HTML, deploy via adapters

Let's trace each phase in detail.

The Initialize Service: Config, Themes, and Plugins

The bootstrap() call (at line 127) delegates to the bootstrap orchestrator we saw in Part 1, which itself calls initialize()—the single most complex function in the codebase. Here's what it does, in order:

sequenceDiagram
    participant Build as build.ts
    participant Bootstrap as bootstrap/index.ts
    participant Init as services/initialize.ts
    participant Parcel as compileGatsbyFiles
    participant Config as load-config
    participant Themes as load-themes
    participant Plugins as load-plugins

    Build->>Bootstrap: bootstrap({ program })
    Bootstrap->>Init: initialize(context)
    Init->>Parcel: compileGatsbyFiles(siteDirectory)
    Note over Parcel: Compile gatsby-config.ts, gatsby-node.ts
    Init->>Config: loadConfig({ siteDirectory })
    Config->>Themes: loadThemes(config)
    Note over Themes: Recursive theme resolution
    Themes-->>Config: Merged config
    Config-->>Init: Validated config
    Init->>Plugins: loadPlugins(config, siteDirectory)
    Note over Plugins: Normalize → Validate → Flatten
    Init->>Init: startPluginRunner()
    Init-->>Bootstrap: { store, workerPool }

Parcel Compilation

Before Gatsby can even read your config, it needs to compile TypeScript and ESM files. At line 170–173 of initialize.ts, compileGatsbyFiles uses Parcel's bundler to transform gatsby-config.ts and gatsby-node.ts into CommonJS that Node can require. This is why you can write your Gatsby config in TypeScript—Parcel handles the transpilation before anything else runs.

Config Loading

The config loader (packages/gatsby/src/bootstrap/load-config/index.ts) does three things:

  1. Reads the compiled gatsby-config file
  2. Processes feature flags via handleFlags
  3. Passes the config through theme resolution via loadThemes

The root config cannot be a function—only theme configs can be functions (receiving theme options as arguments). This distinction is validated at line 28–36.

Theme Resolution

Theme resolution in packages/gatsby/src/bootstrap/load-themes/index.ts is recursive. A theme is just a Gatsby plugin that has its own gatsby-config. Themes can depend on other themes, and each theme's config is merged with its parent. The resolution algorithm walks the dependency tree depth-first, collects all theme configs, and merges them bottom-up.

Plugin Loading

The plugin loader (packages/gatsby/src/bootstrap/load-plugins/index.ts) takes the merged config and:

  1. Normalizes string plugin references into { resolve, options } objects
  2. Validates plugin options against schemas
  3. Loads internal plugins (bundled with Gatsby)
  4. Flattens the plugin tree into a single array
  5. Collates which plugins implement which APIs
  6. Validates exports against known API lists (api-node-docs, api-browser-docs, api-ssr-docs)

The end result is a flat array dispatched to Redux as SET_SITE_FLATTENED_PLUGINS (line 79–82). This array is the canonical registry of every plugin and which APIs each one implements.

The API Runner: Bridging Core and Plugins

Every time Gatsby needs to invoke a plugin hook—sourceNodes, createPages, onCreateWebpackConfig, etc.—it goes through the API runner in packages/gatsby/src/utils/api-runner-node.js.

The API runner is the central nervous system of Gatsby's plugin architecture. For each API call, it:

  1. Looks up which plugins implement the requested API from the flattened plugins list
  2. Constructs a rich context object for each plugin containing:
    • Bound action creators (createNode, createPage, createRedirect, etc.)
    • Data access functions (getNode, getNodes, getNodesByType)
    • Utility functions (createNodeId, createContentDigest)
    • Schema type builders (buildObjectType, buildUnionType, etc.)
    • A cache instance scoped to the plugin
    • A reporter for structured logging
  3. Calls each plugin's implementation sequentially (not in parallel)
sequenceDiagram
    participant Core as Gatsby Core
    participant Runner as api-runner-node
    participant PluginA as gatsby-source-filesystem
    participant PluginB as gatsby-transformer-remark
    participant Redux as Redux Store

    Core->>Runner: apiRunnerNode("sourceNodes", { parentSpan })
    Runner->>Runner: Look up plugins implementing "sourceNodes"
    Runner->>Runner: Construct context (actions, getNode, cache...)
    Runner->>PluginA: sourceNodes(context)
    PluginA->>Redux: createNode(fileNode)
    PluginA-->>Runner: done
    Runner->>PluginB: sourceNodes(context)
    PluginB-->>Runner: done
    Runner-->>Core: results

The action creators passed to plugins are "double-bound": first to Redux via bindActionCreators, then to the specific plugin and API call via a wrapper that injects metadata like traceId, plugin.name, and deferNodeMutation flags (lines 87–100). This ensures every action dispatched by a plugin carries provenance information.

Tip: The sequential execution of plugins is intentional—it ensures deterministic behavior. Plugin A's sourceNodes always runs before Plugin B's, which matters when plugins depend on nodes created by other plugins.

Four Webpack Stages

Gatsby uses four distinct webpack configurations, each serving a different purpose. The factory function in packages/gatsby/src/utils/webpack.config.js takes a stage parameter and produces the appropriate config:

flowchart TD
    subgraph "Development"
        A["develop"] -->|"Hot reload, CSS injection"| A1["Browser bundle"]
        B["develop-html"] -->|"No HMR, SSR target"| B1["HTML renderer"]
    end

    subgraph "Production"
        C["build-javascript"] -->|"Minified, chunked"| C1["Browser JS/CSS"]
        D["build-html"] -->|"Node target, SSR"| D1["HTML renderer"]
    end

    style A fill:#e8f5e9
    style B fill:#e8f5e9
    style C fill:#e1f5fe
    style D fill:#e1f5fe
Stage Target Purpose
develop Browser Dev server with React Fast Refresh and CSS hot reload
develop-html Node SSR renderer for dev mode (without HMR plugins)
build-javascript Browser Production JS and CSS bundles with code splitting
build-html Node Production HTML renderer for static generation

The comment at lines 40–44 is refreshingly clear about this:

// Four stages or modes:
//   1) develop: for `gatsby develop` command, hot reload and CSS injection into page
//   2) develop-html: same as develop without react-hmre in the babel config for html renderer
//   3) build-javascript: Build JS and CSS chunks for production
//   4) build-html: build all HTML files

Plugins can modify any of these configs via the onCreateWebpackConfig hook, which receives the stage parameter so plugins can apply stage-specific modifications.

In the build command, only the production stages are used. At line 160, buildProductionBundle runs the build-javascript stage. Then at line 184, buildRenderer runs the build-html stage.

Query Execution and HTML Generation

After webpack bundles are ready, Gatsby enters the query execution phase. At line 302, calculateDirtyQueries determines which queries need to run (for incremental builds, only changed queries execute).

A critical optimization happens at lines 306–308:

queryIds.pageQueryIds = queryIds.pageQueryIds.filter(
  query => getPageMode(query) === `SSG`
)

Only SSG pages have their queries run at build time. DSG pages defer query execution to the first request, and SSR pages execute queries on every request. This is the rendering mode system at work—we'll cover it in detail in Part 5.

For multi-core machines, queries run in a worker pool (line 330). After queries complete, buildHTMLPagesAndDeleteStaleArtifacts at line 507 generates the final HTML files.

Cache Management and Build Persistence

Throughout the build, state is persisted to the .cache/ directory using LMDB. The Redux store in packages/gatsby/src/redux/index.ts defines exactly which state slices are persisted:

const persistedReduxKeys = [
  `nodes`, `typeOwners`, `statefulSourcePlugins`, `status`,
  `components`, `jobsV2`, `staticQueryComponents`,
  `webpackCompilationHash`, `pageDataStats`, `pages`,
  `staticQueriesByTemplate`, `pendingPageDataWrites`,
  `queries`, `html`, `slices`, `slicesByTemplate`,
]

On the next build, these slices are read back from LMDB via readState() at store creation time. This is how Gatsby knows which queries are "dirty" and which HTML files are "stale"—it compares current state against persisted state.

flowchart LR
    subgraph ".cache/ Directory"
        A["data/datastore (LMDB)"] -->|Nodes| B["Redux State"]
        C["caches-lmdb/"] -->|Plugin caches| D["Per-plugin data"]
        E["redux.state (deprecated)"] -.->|Legacy| B
    end

    subgraph "Build"
        F["Previous Build State"] --> G["Calculate Dirty Queries"]
        G --> H["Only rebuild changed pages"]
    end

    B --> F

The saveState() function at lines 133–146 uses writeToCache with the persisted keys. Note the GATSBY_DISABLE_CACHE_PERSISTENCE escape hatch—this was added to work around Node.js v8.serialize buffer size limits on extremely large sites.

Adapter Integration

The final stage of the build pipeline is adapter deployment. At lines 645–648:

if (adapterManager) {
  await adapterManager.storeCache()
  await adapterManager.adapt()
}

The adapter manager (created during initialization at line 189–192 of initialize.ts) collects a RoutesManifest and FunctionsManifest from the build output and passes them to the platform adapter, which transforms them into the deployment platform's native format. We'll cover adapters in Part 5.

Tip: If a build fails partway through, the .cache/ directory may contain partial state. Running gatsby clean removes it entirely, forcing a full rebuild. This is the nuclear option for debugging cache-related issues.

What's Next

We've seen the build pipeline as a sequential conveyor belt. But gatsby develop faces a much harder problem: it needs to react to changes in real time while maintaining consistency. In the next article, we'll dive into the XState state machines that orchestrate the development server—the most architecturally distinctive feature of the entire codebase.