Cluster Coordination: Master Election, State Management, and the Raft-Like Consensus
Prerequisites
- ›Article 2: Bootstrap and Startup
- ›Distributed systems fundamentals (consensus, leader election)
- ›Understanding of Raft protocol basics
Cluster Coordination: Master Election, State Management, and the Raft-Like Consensus
In a distributed system, the hardest problem isn't storing data — it's ensuring every node agrees on what the data looks like. Elasticsearch solves this with a coordination layer built around two key ideas: an immutable ClusterState that represents the agreed-upon truth, and a Raft-inspired consensus protocol that ensures only one master can update that state at a time. This article dissects both mechanisms, from the ClusterState data structure through the Coordinator's three operating modes and the election protocol that prevents split-brain scenarios.
ClusterState: The Immutable Single Source of Truth
The ClusterState is the central data structure of the coordination layer. As its extensive Javadoc explains, it is "held in memory on all nodes in the cluster with updates coordinated by the elected master."
classDiagram
class ClusterState {
+long version
+String stateUUID
+ClusterName clusterName
+Metadata metadata
+DiscoveryNodes nodes
+GlobalRoutingTable routingTable
+ClusterBlocks blocks
+CompatibilityVersions compatibilityVersions
+Map~String,Custom~ customs
}
class Metadata {
+Map indices
+Map templates
+Settings persistentSettings
+CoordinationMetadata coordination
}
class DiscoveryNodes {
+Map~String,DiscoveryNode~ nodes
+String masterNodeId
+String localNodeId
}
class GlobalRoutingTable {
+Map~ProjectId,RoutingTable~ routingTables
}
ClusterState --> Metadata
ClusterState --> DiscoveryNodes
ClusterState --> GlobalRoutingTable
ClusterState --> ClusterBlocks
The distinction between persistent and ephemeral state is subtle but critical. As stated in the source at ClusterState.java#L82-L84:
"The Metadata portion is written to disk on each update so it persists across full-cluster restarts. The rest of this data is maintained only in-memory and resets back to its initial state on a full-cluster restart, but it is held on all nodes so it persists across master elections."
This means Metadata (index definitions, templates, settings) survives a full cluster restart. But DiscoveryNodes (which nodes are in the cluster) and RoutingTable (which shards are on which nodes) are rebuilt from scratch when the cluster reforms — they persist across rolling restarts because at least one node always remembers them, but not across a full stop.
Diff-Based Publication
Cluster state updates are published using diffs rather than full snapshots. As described in the Javadoc at ClusterState.java#L96-L99, "Publication usually works by sending a diff, computed via the Diffable interface, rather than the full state, although it will fall back to sending the full state if the receiving node is new." This is essential for large clusters where the full state can be many megabytes — sending only changed parts keeps publication fast.
The Coordinator: Three Modes and State Transitions
The Coordinator is the heart of Elasticsearch's consensus implementation. It operates in one of three modes, defined at Coordinator.java#L1841-L1845:
public enum Mode {
CANDIDATE,
LEADER,
FOLLOWER
}
stateDiagram-v2
[*] --> CANDIDATE: Node starts
CANDIDATE --> LEADER: Wins election<br/>(receives quorum of votes)
CANDIDATE --> FOLLOWER: Accepts leader's<br/>cluster state
LEADER --> CANDIDATE: Loses followers / health check fails
FOLLOWER --> CANDIDATE: Leader failure detected /<br/>term change
The Mutex Pattern
All coordination state is protected by a single mutex at Coordinator.java#L159:
final Object mutex = new Object();
Every state transition method asserts that this mutex is held. This design choice — a single lock for all coordination — simplifies reasoning about concurrency at the cost of potential contention. But because all coordination work runs on the single-threaded CLUSTER_COORDINATION pool (as we saw in Part 2), this mutex rarely contends in practice.
State Transitions
The three transition methods at Coordinator.java#L979-L1080 handle mode changes:
becomeCandidate() activates PeerFinder to discover other nodes, starts the ClusterFormationFailureHelper to report problems, deactivates the leader and follower checkers, and clears any ongoing publication. If transitioning from LEADER, it cleans up the MasterService.
becomeLeader() sets mode to LEADER, deactivates PeerFinder (the leader doesn't search for peers — they come to it), closes any prevoting round, starts the LeaderHeartbeatService, and updates the FollowersChecker with the new term.
becomeFollower() stores the known leader reference, starts the LeaderChecker to monitor the leader's health, and deactivates election scheduling.
Tip: When debugging cluster formation issues, look at the
ClusterFormationFailureHelperoutput. It runs only while the node is inCANDIDATEmode and produces detailed diagnostics about why the cluster can't form — missing seed hosts, incompatible versions, or insufficient master-eligible nodes.
Election Protocol: Pre-Voting and Discovery
Elasticsearch's election protocol differs from vanilla Raft in one crucial way: it adds a pre-voting phase to prevent unnecessary elections. In standard Raft, a node that can't reach the leader immediately increments its term and starts an election, which can be disruptive if the leader is actually healthy but temporarily unreachable from that one node.
sequenceDiagram
participant A as Node A (Candidate)
participant B as Node B
participant C as Node C
Note over A: Leader check fails → become CANDIDATE
A->>B: PreVoteRequest (current term)
A->>C: PreVoteRequest (current term)
B-->>A: PreVoteResponse (would vote yes)
C-->>A: PreVoteResponse (would vote yes)
Note over A: Pre-vote quorum reached → start real election
A->>A: Increment term
A->>B: StartJoinRequest (new term)
A->>C: StartJoinRequest (new term)
B-->>A: Join (new term)
C-->>A: Join (new term)
Note over A: Quorum of joins → becomeLeader()
The PreVoteCollector gathers "would you vote for me?" responses without incrementing the term. Only if a quorum of pre-votes is collected does the node proceed to an actual election with term increment. This prevents a partitioned node from repeatedly bumping the cluster's term and forcing unnecessary re-elections.
The discovery layer consists of several cooperating components:
PeerFinder— discovers nodes using configured seed hosts, resolving addresses and exchanging handshakesLeaderChecker— followers periodically check if the leader is aliveFollowersChecker— the leader periodically checks if followers are aliveLagDetector— identifies followers that are slow to apply cluster state updates
ClusterService: Master Updates and State Application
The ClusterService is a composition of two components:
public class ClusterService extends AbstractLifecycleComponent {
private final MasterService masterService;
private final ClusterApplierService clusterApplierService;
flowchart TD
TASK[ClusterStateUpdateTask] --> MS[MasterService<br/>Priority-based task queue]
MS --> EXEC[ClusterStateTaskExecutor<br/>Batch processing]
EXEC --> NEW[New ClusterState]
NEW --> PUB[Publication<br/>Coordinator.publish]
PUB --> CAS[ClusterApplierService<br/>Apply on all nodes]
CAS --> APPLIERS[ClusterStateAppliers<br/>Notified BEFORE state visible]
CAS --> LISTENERS[ClusterStateListeners<br/>Notified AFTER state visible]
MasterService runs only on the elected master. It maintains a priority-based queue of ClusterStateUpdateTask instances. Tasks that share the same ClusterStateTaskExecutor are batched together — this is how Elasticsearch efficiently processes multiple join requests or shard assignment changes in a single state update.
As the ClusterState Javadoc explains at line 89-91: "Submitted tasks have an associated timeout. Tasks are processed in priority order, so a flood of higher-priority tasks can starve lower-priority ones from running."
ClusterApplierService runs on every node. After the master publishes a new state, the applier notifies two types of callbacks:
ClusterStateApplier— notified before the state is visible viaClusterService.state(). Used by components that need to prepare for the new state.ClusterStateListener— notified after the state is visible. Used for reactive behavior.
Multi-Project Architecture (Emerging)
A significant architectural evolution is underway: the multi-project model for serverless deployments. Throughout the codebase, you'll find @FixForMultiProject annotations tracking migration work. The key concepts:
ProjectId— a unique identifier for a project (tenant)ProjectMetadata— per-project metadata extracted from the globalMetadataProjectResolver— resolves the current project context for operationsGlobalRoutingTable— wraps per-project routing tables
In NodeConstruction at line 325, the ProjectResolver is loaded as a pluggable service:
ProjectResolver projectResolver = constructor.pluginsService.loadSingletonServiceProvider(
ProjectResolverFactory.class,
() -> ProjectResolverFactory.DEFAULT
).create();
In the default single-project mode, this resolves to a simple pass-through. But for serverless deployments, a plugin provides a multi-project resolver that isolates tenants within a shared cluster.
Tip: When you encounter
@FixForMultiProjectannotations in the code, they mark locations that still assume a single-project world and need updating. This is an active area of development — searching for this annotation shows the scope of the migration.
Where to Go Next
With the coordination layer understood, we can descend to the data plane. The next article, Part 4: Inside the Storage Engine, explores how Elasticsearch actually stores and retrieves data — the IndexShard lifecycle, the Engine abstraction wrapping Lucene, the Translog write-ahead log, and the sequence number system that makes efficient replication possible.