Fetching, Caching, and the Offline Mirror
Prerequisites
- ›Articles 1-3
- ›Basic understanding of HTTP requests and tarballs
- ›Familiarity with content hashing (SHA-1, SHA-512)
Fetching, Caching, and the Offline Mirror
After resolution determines which versions to install, the fetch phase downloads them. Yarn's fetch layer is built around two priorities: never download the same package twice, and verify every byte. This article covers the five fetcher strategies, the cache directory layout, integrity validation, and the RequestManager that handles DNS caching, retries, and offline operation.
The Fetcher Strategy Pattern
Just as resolvers are selected by dependency specifier type, fetchers are selected by the remote.type field set during resolution. The src/fetchers/index.js registry maps five types to their implementations:
classDiagram
class BaseFetcher {
+dest: string
+remote: PackageRemote
+config: Config
+hash: string
+_fetch(): FetchedOverride
+fetch(defaultManifest): FetchedMetadata
+setupMirrorFromCache(): void
}
class TarballFetcher {
+_fetch(): downloads and extracts .tgz
+setupMirrorFromCache(): copies to offline mirror
}
class GitFetcher {
+_fetch(): clones git repo
+fetchFromLocal(): tries cached tarball
+fetchFromExternal(): git clone
}
class CopyFetcher {
+_fetch(): copies from filesystem
}
class WorkspaceFetcher {
+_fetch(): symlinks workspace
}
BaseFetcher <|-- TarballFetcher
BaseFetcher <|-- GitFetcher
BaseFetcher <|-- CopyFetcher
BaseFetcher <|-- WorkspaceFetcher
| Remote Type | Fetcher | Source |
|---|---|---|
tarball |
TarballFetcher |
Registry tarballs (most common) |
git |
GitFetcher |
Git repositories |
copy |
CopyFetcher |
Local filesystem (file: dependencies) |
workspace |
WorkspaceFetcher |
Workspace packages |
link |
(mock) | Symlinked dependencies (no real fetch) |
The BaseFetcher provides the shared fetch() method that all subclasses use. It follows a consistent sequence:
- Lock the destination directory (via
fs.lockQueue) to prevent concurrent writes. - Create the destination directory.
- Call
_fetch()(subclass-specific) to download and extract the package. - Read the manifest from the extracted contents.
- Set up bin links within the cache entry (lines 67-92).
- Write
.yarn-metadata.jsonalongside the package (line 94) containing the manifest, remote info, and hash.
The metadata file is crucial—it's what enables cache validation on subsequent installs without re-downloading anything.
Fetch Orchestration and Cache Lookup
The package-fetcher.js module orchestrates the entire fetch phase. The main entry point, fetch(), processes all resolved manifests in parallel (up to networkConcurrency):
flowchart TD
A["fetch(manifests, config)"] --> B["For each manifest"]
B --> C["fetchOne(ref, config)"]
C --> D["generateModuleCachePath(ref)"]
D --> E{"isValidModuleDest(dest)?"}
E -->|yes| F["fetchCache(dest, fetcher, config, remote)"]
E -->|no| G["Unlink dest, call fetcher.fetch()"]
F --> H{"Integrity match?"}
H -->|yes| I["Return cached package ✓"]
H -->|no| J["Throw SecurityError ✗"]
G --> K["Return fresh package"]
The fetchCache() function performs integrity validation on cached packages. It reads the stored .yarn-metadata.json, extracts the cached integrity hash, and compares it against the expected integrity from the lockfile using SSRI (Standard Subresource Integrity):
if (remote.integrity) {
if (!cacheIntegrity || !ssri.parse(cacheIntegrity).match(remote.integrity)) {
throw new SecurityError(/* ... */);
}
}
If the integrity check fails, Yarn throws a SecurityError—not a soft warning, but a hard failure. This is a deliberate security-first design: a corrupted cache entry is treated as a potential supply chain attack.
After fetching, the fetch() function (line 109) updates each manifest's _reference and _remote with the fresh hash, and handles checksum updates when --update-checksums is passed (lines 146-157).
Tip: If you see
fetchBadIntegrityCacheorfetchBadHashCacheerrors, tryyarn cache cleanto clear the cache. If the error persists, the registry itself may be serving corrupted tarballs—use--update-checksumsto re-verify.
Cache Layout and Fallback Chain
The cache directory is versioned to avoid backward-incompatibility issues. In Config.init(), the cache folder is resolved through a fallback chain:
--cache-folderCLI flag (absolute override)cache-folderfrom.yarnrc--preferred-cache-folderor thepreferred-cache-folderRC option- Platform-specific defaults via
getPreferredCacheDirectories():- XDG cache dir (Linux:
~/.cache/yarn) /tmp/.yarn-cache-<uid>(per-user fallback)/tmp/.yarn-cache(shared fallback)
- XDG cache dir (Linux:
The CACHE_VERSION constant (currently 6) is appended as a subdirectory: the actual cache lives at <cache-root>/v6/. Bumping this number invalidates all cached packages—a blunt but effective cache-busting mechanism.
flowchart TD
A["Cache Root<br/>(~/.cache/yarn)"] --> B["v6/"]
B --> C["npm-lodash-4.17.21-<hash>/"]
C --> D[".yarn-metadata.json"]
C --> E[".yarn-tarball.tgz"]
C --> F["package.json"]
C --> G["lodash.js"]
C --> H[".bin/"]
B --> I["npm-express-4.18.2-<hash>/"]
B --> J[".tmp/"]
Each cached package directory is named with a deterministic hash derived from the package name, version, and remote reference. Inside, .yarn-metadata.json stores the manifest and remote information, while .yarn-tarball.tgz is the original tarball (used for the offline mirror).
The Request Manager
The RequestManager is Yarn's HTTP client, handling all network communication with registries. It's initialized early in Config's constructor and configured during init().
Several features make it more than a simple HTTP wrapper:
DNS caching: At the module level (line 22), dnscache is initialized to cache DNS lookups for 300 seconds with up to 10 entries. This prevents redundant DNS resolution when downloading hundreds of packages from the same registry.
Concurrency control: The manager uses its own queue system with NETWORK_CONCURRENCY (default: 8) as the limit. This prevents overwhelming registries and avoids file descriptor exhaustion.
Success host tracking: A module-level successHosts map (line 27) remembers which hosts have responded successfully. Combined with prefer-offline mode, this enables Yarn to skip network requests for hosts that previously worked—falling back to cache when possible.
Retry with backoff: Failed requests are retried up to maxRetryAttempts (default: 5) times. The request is re-queued with incremented retryAttempts.
HAR capture: When --har is passed, RequestCaptureHar records all HTTP traffic to a HAR file for debugging network issues.
OTP handling: For npm publish operations, the manager intercepts 401 responses with one-time-pass requirements and prompts for OTP codes.
sequenceDiagram
participant F as Fetcher
participant RM as RequestManager
participant DNS as DNS Cache
participant R as Registry
F->>RM: request(url, options)
RM->>DNS: Resolve hostname
DNS-->>RM: Cached IP
RM->>R: HTTP GET
R-->>RM: 200 + tarball
RM->>RM: Track successHost
RM-->>F: Response body
Note over RM: On failure: retry up to 5x
The TarballFetcher is the most frequently used fetcher. It downloads .tgz files from registries, streams them through gunzip-maybe and tar-fs, and verifies the hash. Its setupMirrorFromCache() method (line 46) copies cached tarballs to the offline mirror directory when configured.
The GitFetcher has a more complex flow. It first tries local paths (mirror, cache), then falls back to a full git clone. If the repository has a prepare script, it actually runs yarn install inside the cloned repo (line 155), then packs the result—a mini-install within the install.
Tip: The offline mirror feature (
yarn-offline-mirrorin.yarnrc) copies every downloaded tarball to a directory you control. Commit this directory to version control for fully reproducible, zero-network installs. TheTarballFetcher.setupMirrorFromCache()method handles the copying.
What's Next
With packages resolved and fetched into the cache, the next challenge is the most algorithmically complex: flattening the dependency tree into a node_modules directory. In Article 5, we'll dissect the hoisting algorithm, the taint system that prevents version conflicts, workspace nohoist, bin linking, and Plug'n'Play as an alternative that eliminates node_modules entirely.