Read OSS

Inside the Storage Engine: IndexShard, Lucene Integration, and the Translog

Advanced

Prerequisites

  • Article 3: Cluster Coordination (understanding ClusterState and shard routing)
  • Apache Lucene fundamentals (IndexWriter, DirectoryReader, Document/Field model)
  • Write-ahead log concepts

Inside the Storage Engine: IndexShard, Lucene Integration, and the Translog

When a document arrives at an Elasticsearch shard, it passes through a carefully layered storage stack designed to balance three competing demands: durability (never lose acknowledged data), performance (sub-second indexing), and searchability (near-real-time search). This article descends into that stack, from the IndexShard class that serves as the single entry point for all shard operations, through the Engine abstraction that wraps Lucene, to the Translog that bridges per-operation durability with Lucene's batch commit model.

IndexShard: The Single Entry Point

The IndexShard class is approximately 5,000 lines — one of the largest classes in the codebase. This isn't accidental; it's a deliberate design choice to route all shard operations through a single point that can enforce lifecycle constraints, concurrency limits, and operation permits.

Shard Lifecycle States

An IndexShard progresses through a strict state machine:

stateDiagram-v2
    [*] --> CREATED: Shard allocated
    CREATED --> RECOVERING: Recovery starts
    RECOVERING --> POST_RECOVERY: Recovery complete
    POST_RECOVERY --> STARTED: Shard activated
    STARTED --> RELOCATED: Shard moving
    RELOCATED --> [*]
    STARTED --> CLOSED: Shard removed
    CLOSED --> [*]

Operations check the current state before proceeding. An index operation against a shard in RECOVERING state will be rejected. A search against a CREATED shard will fail. This state machine is the first line of defense against executing operations at the wrong time.

Primary vs. Replica Behavior

The same IndexShard class handles both primary and replica roles, but the code paths diverge significantly:

  • Primary shards perform version checking, assign sequence numbers, and write to the Translog and Lucene. They then replicate to replicas.
  • Replica shards receive pre-assigned sequence numbers from the primary and apply operations in order. They skip version checking (the primary already did it).

This dual responsibility is managed through operation permits. The shard's indexShardOperationPermits semaphore allows concurrent read operations while serializing writes. During relocation, permits are blocked entirely to prevent in-flight operations from being lost.

The Engine Abstraction and InternalEngine

The Engine is an abstract class that defines the storage contract. It has three implementations:

classDiagram
    class Engine {
        <<abstract>>
        +index(Index) IndexResult
        +delete(Delete) DeleteResult
        +get(Get) GetResult
        +acquireSearcher(String) Searcher
        +refresh(String) RefreshResult
        +flush() FlushResult
    }
    class InternalEngine {
        -IndexWriter indexWriter
        -Translog translog
        -LiveVersionMap versionMap
        -ExternalReaderManager externalReaderManager
        -ElasticsearchReaderManager internalReaderManager
        -LocalCheckpointTracker localCheckpointTracker
    }
    class ReadOnlyEngine {
        Note: For searchable snapshots
    }
    class NoOpEngine {
        Note: For closed indices
    }
    Engine <|-- InternalEngine
    Engine <|-- ReadOnlyEngine
    Engine <|-- NoOpEngine

InternalEngine is the workhorse. Looking at its key fields at InternalEngine.java#L143-L184:

private final Translog translog;
private final IndexWriter indexWriter;
private final ExternalReaderManager externalReaderManager;
private final ElasticsearchReaderManager internalReaderManager;
private final LiveVersionMap versionMap;
private final LocalCheckpointTracker localCheckpointTracker;

The Dual Reader Manager Pattern

One of InternalEngine's most interesting design choices is maintaining two separate reader managers:

  • externalReaderManager — used for search operations. Refreshed periodically (default: every 1 second) to provide near-real-time search.
  • internalReaderManager — used for version lookups during indexing. Refreshed more aggressively to ensure version checks see the latest writes.

This separation means search refreshes don't block indexing version checks, and vice versa. The LiveVersionMap acts as a write-through cache for recently indexed document versions, allowing version checks to succeed without waiting for a Lucene refresh at all.

The Write Path: From Index Operation to Disk

Tracing a document index operation through InternalEngine reveals the careful ordering of durability and indexing steps:

flowchart TD
    OP[Index Operation arrives] --> PERM[Acquire operation permit]
    PERM --> VER{Version check<br/>via LiveVersionMap}
    VER -->|Conflict| REJECT[Reject: VersionConflictException]
    VER -->|OK| SEQNO[Assign sequence number<br/>via LocalCheckpointTracker]
    SEQNO --> TL[Append to Translog<br/>durability guarantee]
    TL --> LUC[Write to Lucene IndexWriter<br/>in-memory buffer]
    LUC --> VMP[Update LiveVersionMap<br/>new version visible for checks]
    VMP --> ACK[Acknowledge to client]

    ACK -.->|Async, every 1s| REFRESH[Refresh: make searchable]
    ACK -.->|Async, on flush| COMMIT[Lucene commit + Translog trim]

The critical insight is the ordering: Translog append happens before Lucene write. This means if the process crashes between these two steps, the operation is in the Translog but not in Lucene — it will be replayed during recovery. If the Translog append fails, the Lucene write never happens.

The Lucene IndexWriter.addDocument() call only writes to an in-memory buffer. The document isn't searchable until a refresh creates a new DirectoryReader segment. But it's durable immediately because the Translog has it.

Tip: Understanding this write path explains many Elasticsearch behaviors. The 1-second "near-real-time" delay for search is the refresh interval. The _refresh API forces a reader reopen. The _flush API triggers a Lucene commit and allows Translog trimming. And index.translog.durability: request vs async controls whether each operation waits for an fsync.

Translog: Write-Ahead Log for Durability

The Translog class provides per-operation durability that bridges the gap between Elasticsearch's "acknowledge after every operation" semantics and Lucene's batch-oriented commit model.

flowchart LR
    subgraph Active Generation
        TW[TranslogWriter<br/>translog-N.tlog]
        CP[translog.ckp<br/>Checkpoint file]
    end

    subgraph Older Generations
        TR1[TranslogReader<br/>translog-N-1.tlog]
        CP1[translog-N-1.ckp]
        TR2[TranslogReader<br/>translog-N-2.tlog]
    end

    OP1[Op 1] --> TW
    OP2[Op 2] --> TW
    OP3[Op 3] --> TW
    TW --> CP

    TR1 -.->|Retained until<br/>global checkpoint advances| GC[Trimmed after<br/>Lucene commit]

The Translog uses a generation-based file model. The current generation is an active TranslogWriter. When a generation threshold is reached (size-based, or on primary term change), the current writer is sealed as a read-only TranslogReader and a new writer opens.

Each generation has an associated checkpoint file (.ckp) that records the number of operations, the fsynced offset, and the minimum/maximum sequence numbers. The checkpoint is fsynced first, then the translog data — this ordering ensures that recovery can always determine the last consistent state.

When Generations Get Trimmed

A translog generation can be deleted only after two conditions are met:

  1. All operations in that generation have been committed to a Lucene segment (via IndexWriter.commit())
  2. The global checkpoint has advanced past the generation's maximum sequence number

This retention policy is what enables efficient peer recovery — a recovering replica only needs to replay translog operations after its last known checkpoint, rather than copying entire Lucene segments.

Sequence Numbers and the Global Checkpoint

The sequence number system is Elasticsearch's mechanism for ordering operations across the cluster and enabling efficient recovery. Every operation — index, delete, or no-op — receives a monotonically increasing sequence number assigned by the primary shard.

sequenceDiagram
    participant P as Primary Shard
    participant R1 as Replica 1
    participant R2 as Replica 2

    Note over P: Local checkpoint: 42
    P->>P: Index doc (seq# 43)
    P->>R1: Replicate (seq# 43)
    P->>R2: Replicate (seq# 43)
    R1-->>P: ACK seq# 43
    Note over P: R1 local checkpoint: 43
    R2-->>P: ACK seq# 43
    Note over P: R2 local checkpoint: 43
    Note over P: Global checkpoint: 42 → 43<br/>(all replicas caught up)
    P->>R1: GlobalCheckpointSync(43)
    P->>R2: GlobalCheckpointSync(43)

The local checkpoint on each shard tracks the highest contiguous sequence number that shard has completed. The global checkpoint is the minimum of all local checkpoints across all copies (primary + replicas) — it represents the highest sequence number that is guaranteed to be durable on every copy.

The LocalCheckpointTracker in InternalEngine manages the local checkpoint. It handles out-of-order completion — operations might complete in different orders than they were assigned, but the local checkpoint only advances when all operations up to that point are complete.

The global checkpoint enables three critical optimizations:

  1. Translog trimming — generations below the global checkpoint can be deleted after the next Lucene commit
  2. Peer recovery — a recovering replica only replays operations after the last known global checkpoint
  3. Soft deletes — Lucene can merge away deleted documents once their sequence numbers are below the global checkpoint

Tip: The sequence number system is why Elasticsearch recovery is typically fast. Rather than copying gigabytes of Lucene segments, a replica that falls behind only needs to replay the translog operations since its last checkpoint. Watch the seq_no and global_checkpoint values in the _recovery API output to understand where a recovery is starting from.

Where to Go Next

We've now explored both the coordination plane (Part 3) and the data plane (this article). In Part 5: The Request Pipeline, we'll connect these layers by tracing the complete lifecycle of a request from HTTP arrival through REST routing, transport dispatch, action execution patterns, and finally the distributed search pipeline.