From JVM Launch to Cluster Join: The Elasticsearch Startup Sequence
Prerequisites
- ›Article 1: Architecture Overview
- ›Java module system (JPMS) basics
- ›Dependency injection concepts
From JVM Launch to Cluster Join: The Elasticsearch Startup Sequence
Starting an Elasticsearch node is deceptively complex. What appears to be a simple java -jar invocation triggers a precisely choreographed startup sequence that must initialize logging before anything else, lock memory and install security sandboxes before any threads spawn, wire together dozens of interdependent services in the correct order, and — critically — start accepting HTTP traffic only after the node is fully ready. One misstep in this ordering and the node either crashes or accepts requests it can't properly handle.
This article traces the complete path from Elasticsearch.main() through the final HttpServerTransport.start().
Three-Phase Bootstrap in Elasticsearch.java
The entry point is Elasticsearch.main(), which dispatches to exactly three phases:
public static void main(final String[] args) {
Bootstrap bootstrap = initPhase1();
assert bootstrap != null;
try {
initPhase2(bootstrap);
initPhase3(bootstrap);
} catch (NodeValidationException e) {
bootstrap.exitWithNodeValidationException(e);
} catch (Throwable t) {
bootstrap.exitWithUnknownException(t);
}
}
sequenceDiagram
participant JVM
participant Phase1
participant Phase2
participant Phase3
JVM->>Phase1: initPhase1()
Note over Phase1: Static init, read CLI args,<br/>configure logging LAST
Phase1-->>JVM: Bootstrap object
JVM->>Phase2: initPhase2(bootstrap)
Note over Phase2: Log system info, PID file,<br/>native access, JarHell,<br/>plugin loading, entitlements
JVM->>Phase3: initPhase3(bootstrap)
Note over Phase3: Construct Node,<br/>start Node,<br/>signal readiness
Phase 1: Static Init and Logging
initPhase1() does the absolute minimum: initializes security properties, reads ServerArgs from stdin (the CLI launcher process pipes them in), creates a basic Environment, and configures logging. The source code contains an emphatic comment:
// DO NOT MOVE THIS
// Logging must remain the last step of phase 1.
This constraint exists because any initialization step that needs logging must happen in Phase 2, after logging is configured. Phase 1 writes exceptions directly to stderr because the logging framework isn't ready yet.
Phase 2: Security and Native Initialization
initPhase2() is the most complex bootstrap phase. It handles:
- System info logging — JVM version, OS, build hash
- PID file creation with a shutdown hook for cleanup
- Uncaught exception handler registration
- Native controller spawning (for ML or other native processes)
- Native access initialization — memory locking (
mlockall), system call filters, coredump filter configuration - JarHell check — scans for duplicate classes on the classpath
- Plugin loading — loads module and plugin bundles, creates JPMS module layers
- Entitlement bootstrap — the new security system replacing
SecurityManager, with a self-test that verifies process creation is properly blocked
Tip: The entitlement system (
EntitlementBootstrap) is Elasticsearch's replacement for the deprecated Java SecurityManager. It uses bytecode instrumentation to intercept sensitive operations. After bootstrapping, a self-test at line 274 attempts to start a process and verifies that it's properly denied.
Phase 3: Node Construction and Startup
initPhase3() first verifies Lucene version compatibility, then constructs the Node, starts it, and signals readiness to the parent CLI process. The readiness signal is the last thing that happens — another "DO NOT MOVE THIS" comment guards this ordering:
// DO NOT MOVE THIS
// Signaling readiness to accept requests must remain the last step
NodeConstruction: The 1,888-Line Orchestration
The Node constructor delegates the heavy lifting to NodeConstruction.prepareConstruction(). This method is the largest single orchestration point in the codebase — wiring together dozens of services in strict dependency order.
flowchart TD
START[prepareConstruction] --> ENV[createEnvironment<br/>Plugin loading, settings merge]
ENV --> TEL[TelemetryProvider]
TEL --> TP[ThreadPool creation]
TP --> SM[SettingsModule validation]
SM --> PR[ProjectResolver<br/>single vs multi-project]
PR --> SEARCH[SearchModule]
SEARCH --> REG[Client & Registries<br/>NamedWriteableRegistry, XContentRegistry]
REG --> SCRIPT[ScriptService]
SCRIPT --> ANALYSIS[AnalysisRegistry]
ANALYSIS --> CONSTRUCT[construct<br/>ClusterService, IngestService,<br/>IndicesService, TransportService,<br/>ActionModule, and more]
CONSTRUCT --> GUICE[Guice binding & Injector creation]
The construction order is dictated by dependencies. For example, ThreadPool must exist before SettingsModule (because settings validation uses thread context), and SearchModule must exist before ActionModule (because action registration depends on search capabilities).
Let's look at the top of prepareConstruction:
static NodeConstruction prepareConstruction(
Environment initialEnvironment,
PluginsLoader pluginsLoader,
NodeServiceProvider serviceProvider,
boolean forbidPrivateIndexSettings
) {
List<Closeable> closeables = new ArrayList<>();
try {
NodeConstruction constructor = new NodeConstruction(closeables);
Settings settings = constructor.createEnvironment(initialEnvironment, serviceProvider, pluginsLoader);
// ...
The closeables list is a cleanup mechanism — if construction fails partway through, all already-created resources are properly closed. This is critical for a system that opens file handles, thread pools, and network connections during initialization.
The Environment and Plugin Loading Phase
createEnvironment() creates the PluginsService, merges plugin-provided settings with user settings, and builds the final Environment. This is where the PluginsService — the runtime container for all plugins — comes to life.
The ProjectResolver initialization at line 325 is noteworthy — it's part of the emerging multi-project architecture for serverless deployments. In the default single-project mode, it resolves to ProjectResolverFactory.DEFAULT.
Plugin Loading via Module Layers
Elasticsearch uses the Java Platform Module System (JPMS) to isolate plugins. Each plugin gets its own module layer and class loader, created during Phase 2 by PluginsLoader. This provides several guarantees:
- Plugins cannot access each other's internals
- Different plugins can depend on different versions of the same library
- Plugin code can be selectively granted entitlements (permissions)
graph TD
BOOT[Boot Module Layer<br/>JDK modules] --> SERVER[Server Module Layer<br/>Elasticsearch core]
SERVER --> MOD1[Module Layer: lang-painless]
SERVER --> MOD2[Module Layer: repository-s3]
SERVER --> MOD3[Module Layer: x-pack-security]
SERVER --> MODN[Module Layer: ...]
The Plugin.PluginServices interface serves as a dependency injection bridge — plugins receive a PluginServices instance during createComponents() that gives them access to Client, ClusterService, ThreadPool, and other core services without requiring Guice bindings.
Node.start(): Service Startup Ordering
After construction, Node.start() starts services in a carefully specified order. This sequence can be grouped into five tiers:
sequenceDiagram
participant Node
participant Tier1 as Tier 1: Foundations
participant Tier2 as Tier 2: Cluster
participant Tier3 as Tier 3: Coordination
participant Tier4 as Tier 4: Join
participant Tier5 as Tier 5: HTTP (LAST)
Node->>Tier1: Plugin lifecycle, IndicesService,<br/>SnapshotsService, SearchService,<br/>FsHealthService, NodeMetrics
Node->>Tier2: ClusterService, NodeConnectionsService,<br/>GatewayService
Node->>Tier3: TransportService.start(),<br/>Coordinator.start(),<br/>ClusterService.start()
Node->>Tier4: coordinator.startInitialJoin()<br/>Wait for master (with timeout)
Node->>Tier5: HttpServerTransport.start()<br/>ReadinessService.start()
The first tier starts foundational services — IndicesService (index management), SnapshotsService, SearchService, and monitoring components. These don't depend on cluster state.
The second tier starts cluster-aware services. NodeConnectionsService manages persistent connections to other nodes. GatewayService handles metadata recovery from disk.
The third tier is where the node joins the cluster. TransportService.start() binds the transport port and makes the node reachable. Then Coordinator.start() and ClusterService.start() bring up the consensus layer. The coordinator begins its initial join process.
At line 378-423, there's a blocking wait with timeout for the initial cluster state. If the node can't find a master within the configured timeout, it logs a warning with a troubleshooting reference link.
Finally, at line 428 — after a prominent DO NOT ADD NEW START CALLS BELOW HERE comment — HttpServerTransport.start() opens the HTTP port. This deliberate ordering ensures that no HTTP request can arrive before the node is fully operational.
Tip: If you're debugging startup issues, the tier ordering tells you which services might not be initialized yet. A failure in the transport tier means HTTP never started — the node won't even be reachable for debugging via REST APIs.
ThreadPool Design and Named Pools
The ThreadPool is created early in NodeConstruction because nearly every other service depends on it. Elasticsearch defines approximately 20 named pools, each tuned for a specific workload:
| Pool Name | Type | Purpose |
|---|---|---|
GENERIC |
Scaling | Catch-all for recovery, misc tasks. Very high max size. |
CLUSTER_COORDINATION |
Fixed(1) | Single-threaded to avoid contention on Coordinator#mutex |
SEARCH |
Fixed(cpus) | Query phase execution |
WRITE |
Fixed(cpus) | Indexing operations |
GET |
Fixed(cpus) | Real-time get operations |
MANAGEMENT |
Scaling(small) | Stats collection, admin tasks |
SNAPSHOT |
Scaling | Snapshot/restore operations |
FLUSH |
Scaling | Lucene flush and translog operations |
The CLUSTER_COORDINATION pool's default size of 1 is a deliberate design choice — the Coordinator class uses a single mutex to protect all coordination state, and running coordination work on multiple threads would create contention without improving throughput. We'll explore this mutex pattern in detail in the next article.
Where to Go Next
Now that we understand how a single node boots up and starts its services, the next article Part 3: Cluster Coordination explores how multiple nodes discover each other, elect a master, and maintain a consistent view of cluster state through Elasticsearch's Raft-like consensus protocol.