Read OSS

Cluster Coordination: Master Election, State Management, and the Raft-Like Consensus

Advanced

Prerequisites

  • Article 2: Bootstrap and Startup
  • Distributed systems fundamentals (consensus, leader election)
  • Understanding of Raft protocol basics

Cluster Coordination: Master Election, State Management, and the Raft-Like Consensus

In a distributed system, the hardest problem isn't storing data — it's ensuring every node agrees on what the data looks like. Elasticsearch solves this with a coordination layer built around two key ideas: an immutable ClusterState that represents the agreed-upon truth, and a Raft-inspired consensus protocol that ensures only one master can update that state at a time. This article dissects both mechanisms, from the ClusterState data structure through the Coordinator's three operating modes and the election protocol that prevents split-brain scenarios.

ClusterState: The Immutable Single Source of Truth

The ClusterState is the central data structure of the coordination layer. As its extensive Javadoc explains, it is "held in memory on all nodes in the cluster with updates coordinated by the elected master."

classDiagram
    class ClusterState {
        +long version
        +String stateUUID
        +ClusterName clusterName
        +Metadata metadata
        +DiscoveryNodes nodes
        +GlobalRoutingTable routingTable
        +ClusterBlocks blocks
        +CompatibilityVersions compatibilityVersions
        +Map~String,Custom~ customs
    }
    class Metadata {
        +Map indices
        +Map templates
        +Settings persistentSettings
        +CoordinationMetadata coordination
    }
    class DiscoveryNodes {
        +Map~String,DiscoveryNode~ nodes
        +String masterNodeId
        +String localNodeId
    }
    class GlobalRoutingTable {
        +Map~ProjectId,RoutingTable~ routingTables
    }
    ClusterState --> Metadata
    ClusterState --> DiscoveryNodes
    ClusterState --> GlobalRoutingTable
    ClusterState --> ClusterBlocks

The distinction between persistent and ephemeral state is subtle but critical. As stated in the source at ClusterState.java#L82-L84:

"The Metadata portion is written to disk on each update so it persists across full-cluster restarts. The rest of this data is maintained only in-memory and resets back to its initial state on a full-cluster restart, but it is held on all nodes so it persists across master elections."

This means Metadata (index definitions, templates, settings) survives a full cluster restart. But DiscoveryNodes (which nodes are in the cluster) and RoutingTable (which shards are on which nodes) are rebuilt from scratch when the cluster reforms — they persist across rolling restarts because at least one node always remembers them, but not across a full stop.

Diff-Based Publication

Cluster state updates are published using diffs rather than full snapshots. As described in the Javadoc at ClusterState.java#L96-L99, "Publication usually works by sending a diff, computed via the Diffable interface, rather than the full state, although it will fall back to sending the full state if the receiving node is new." This is essential for large clusters where the full state can be many megabytes — sending only changed parts keeps publication fast.

The Coordinator: Three Modes and State Transitions

The Coordinator is the heart of Elasticsearch's consensus implementation. It operates in one of three modes, defined at Coordinator.java#L1841-L1845:

public enum Mode {
    CANDIDATE,
    LEADER,
    FOLLOWER
}
stateDiagram-v2
    [*] --> CANDIDATE: Node starts
    CANDIDATE --> LEADER: Wins election<br/>(receives quorum of votes)
    CANDIDATE --> FOLLOWER: Accepts leader's<br/>cluster state
    LEADER --> CANDIDATE: Loses followers / health check fails
    FOLLOWER --> CANDIDATE: Leader failure detected /<br/>term change

The Mutex Pattern

All coordination state is protected by a single mutex at Coordinator.java#L159:

final Object mutex = new Object();

Every state transition method asserts that this mutex is held. This design choice — a single lock for all coordination — simplifies reasoning about concurrency at the cost of potential contention. But because all coordination work runs on the single-threaded CLUSTER_COORDINATION pool (as we saw in Part 2), this mutex rarely contends in practice.

State Transitions

The three transition methods at Coordinator.java#L979-L1080 handle mode changes:

becomeCandidate() activates PeerFinder to discover other nodes, starts the ClusterFormationFailureHelper to report problems, deactivates the leader and follower checkers, and clears any ongoing publication. If transitioning from LEADER, it cleans up the MasterService.

becomeLeader() sets mode to LEADER, deactivates PeerFinder (the leader doesn't search for peers — they come to it), closes any prevoting round, starts the LeaderHeartbeatService, and updates the FollowersChecker with the new term.

becomeFollower() stores the known leader reference, starts the LeaderChecker to monitor the leader's health, and deactivates election scheduling.

Tip: When debugging cluster formation issues, look at the ClusterFormationFailureHelper output. It runs only while the node is in CANDIDATE mode and produces detailed diagnostics about why the cluster can't form — missing seed hosts, incompatible versions, or insufficient master-eligible nodes.

Election Protocol: Pre-Voting and Discovery

Elasticsearch's election protocol differs from vanilla Raft in one crucial way: it adds a pre-voting phase to prevent unnecessary elections. In standard Raft, a node that can't reach the leader immediately increments its term and starts an election, which can be disruptive if the leader is actually healthy but temporarily unreachable from that one node.

sequenceDiagram
    participant A as Node A (Candidate)
    participant B as Node B
    participant C as Node C

    Note over A: Leader check fails → become CANDIDATE

    A->>B: PreVoteRequest (current term)
    A->>C: PreVoteRequest (current term)
    B-->>A: PreVoteResponse (would vote yes)
    C-->>A: PreVoteResponse (would vote yes)

    Note over A: Pre-vote quorum reached → start real election

    A->>A: Increment term
    A->>B: StartJoinRequest (new term)
    A->>C: StartJoinRequest (new term)
    B-->>A: Join (new term)
    C-->>A: Join (new term)

    Note over A: Quorum of joins → becomeLeader()

The PreVoteCollector gathers "would you vote for me?" responses without incrementing the term. Only if a quorum of pre-votes is collected does the node proceed to an actual election with term increment. This prevents a partitioned node from repeatedly bumping the cluster's term and forcing unnecessary re-elections.

The discovery layer consists of several cooperating components:

  • PeerFinder — discovers nodes using configured seed hosts, resolving addresses and exchanging handshakes
  • LeaderChecker — followers periodically check if the leader is alive
  • FollowersChecker — the leader periodically checks if followers are alive
  • LagDetector — identifies followers that are slow to apply cluster state updates

ClusterService: Master Updates and State Application

The ClusterService is a composition of two components:

public class ClusterService extends AbstractLifecycleComponent {
    private final MasterService masterService;
    private final ClusterApplierService clusterApplierService;
flowchart TD
    TASK[ClusterStateUpdateTask] --> MS[MasterService<br/>Priority-based task queue]
    MS --> EXEC[ClusterStateTaskExecutor<br/>Batch processing]
    EXEC --> NEW[New ClusterState]
    NEW --> PUB[Publication<br/>Coordinator.publish]
    PUB --> CAS[ClusterApplierService<br/>Apply on all nodes]
    CAS --> APPLIERS[ClusterStateAppliers<br/>Notified BEFORE state visible]
    CAS --> LISTENERS[ClusterStateListeners<br/>Notified AFTER state visible]

MasterService runs only on the elected master. It maintains a priority-based queue of ClusterStateUpdateTask instances. Tasks that share the same ClusterStateTaskExecutor are batched together — this is how Elasticsearch efficiently processes multiple join requests or shard assignment changes in a single state update.

As the ClusterState Javadoc explains at line 89-91: "Submitted tasks have an associated timeout. Tasks are processed in priority order, so a flood of higher-priority tasks can starve lower-priority ones from running."

ClusterApplierService runs on every node. After the master publishes a new state, the applier notifies two types of callbacks:

  1. ClusterStateApplier — notified before the state is visible via ClusterService.state(). Used by components that need to prepare for the new state.
  2. ClusterStateListener — notified after the state is visible. Used for reactive behavior.

Multi-Project Architecture (Emerging)

A significant architectural evolution is underway: the multi-project model for serverless deployments. Throughout the codebase, you'll find @FixForMultiProject annotations tracking migration work. The key concepts:

  • ProjectId — a unique identifier for a project (tenant)
  • ProjectMetadata — per-project metadata extracted from the global Metadata
  • ProjectResolver — resolves the current project context for operations
  • GlobalRoutingTable — wraps per-project routing tables

In NodeConstruction at line 325, the ProjectResolver is loaded as a pluggable service:

ProjectResolver projectResolver = constructor.pluginsService.loadSingletonServiceProvider(
    ProjectResolverFactory.class,
    () -> ProjectResolverFactory.DEFAULT
).create();

In the default single-project mode, this resolves to a simple pass-through. But for serverless deployments, a plugin provides a multi-project resolver that isolates tenants within a shared cluster.

Tip: When you encounter @FixForMultiProject annotations in the code, they mark locations that still assume a single-project world and need updating. This is an active area of development — searching for this annotation shows the scope of the migration.

Where to Go Next

With the coordination layer understood, we can descend to the data plane. The next article, Part 4: Inside the Storage Engine, explores how Elasticsearch actually stores and retrieves data — the IndexShard lifecycle, the Engine abstraction wrapping Lucene, the Translog write-ahead log, and the sequence number system that makes efficient replication possible.