Inside the Storage Engine: IndexShard, Lucene Integration, and the Translog
Prerequisites
- ›Article 3: Cluster Coordination (understanding ClusterState and shard routing)
- ›Apache Lucene fundamentals (IndexWriter, DirectoryReader, Document/Field model)
- ›Write-ahead log concepts
Inside the Storage Engine: IndexShard, Lucene Integration, and the Translog
When a document arrives at an Elasticsearch shard, it passes through a carefully layered storage stack designed to balance three competing demands: durability (never lose acknowledged data), performance (sub-second indexing), and searchability (near-real-time search). This article descends into that stack, from the IndexShard class that serves as the single entry point for all shard operations, through the Engine abstraction that wraps Lucene, to the Translog that bridges per-operation durability with Lucene's batch commit model.
IndexShard: The Single Entry Point
The IndexShard class is approximately 5,000 lines — one of the largest classes in the codebase. This isn't accidental; it's a deliberate design choice to route all shard operations through a single point that can enforce lifecycle constraints, concurrency limits, and operation permits.
Shard Lifecycle States
An IndexShard progresses through a strict state machine:
stateDiagram-v2
[*] --> CREATED: Shard allocated
CREATED --> RECOVERING: Recovery starts
RECOVERING --> POST_RECOVERY: Recovery complete
POST_RECOVERY --> STARTED: Shard activated
STARTED --> RELOCATED: Shard moving
RELOCATED --> [*]
STARTED --> CLOSED: Shard removed
CLOSED --> [*]
Operations check the current state before proceeding. An index operation against a shard in RECOVERING state will be rejected. A search against a CREATED shard will fail. This state machine is the first line of defense against executing operations at the wrong time.
Primary vs. Replica Behavior
The same IndexShard class handles both primary and replica roles, but the code paths diverge significantly:
- Primary shards perform version checking, assign sequence numbers, and write to the Translog and Lucene. They then replicate to replicas.
- Replica shards receive pre-assigned sequence numbers from the primary and apply operations in order. They skip version checking (the primary already did it).
This dual responsibility is managed through operation permits. The shard's indexShardOperationPermits semaphore allows concurrent read operations while serializing writes. During relocation, permits are blocked entirely to prevent in-flight operations from being lost.
The Engine Abstraction and InternalEngine
The Engine is an abstract class that defines the storage contract. It has three implementations:
classDiagram
class Engine {
<<abstract>>
+index(Index) IndexResult
+delete(Delete) DeleteResult
+get(Get) GetResult
+acquireSearcher(String) Searcher
+refresh(String) RefreshResult
+flush() FlushResult
}
class InternalEngine {
-IndexWriter indexWriter
-Translog translog
-LiveVersionMap versionMap
-ExternalReaderManager externalReaderManager
-ElasticsearchReaderManager internalReaderManager
-LocalCheckpointTracker localCheckpointTracker
}
class ReadOnlyEngine {
Note: For searchable snapshots
}
class NoOpEngine {
Note: For closed indices
}
Engine <|-- InternalEngine
Engine <|-- ReadOnlyEngine
Engine <|-- NoOpEngine
InternalEngine is the workhorse. Looking at its key fields at InternalEngine.java#L143-L184:
private final Translog translog;
private final IndexWriter indexWriter;
private final ExternalReaderManager externalReaderManager;
private final ElasticsearchReaderManager internalReaderManager;
private final LiveVersionMap versionMap;
private final LocalCheckpointTracker localCheckpointTracker;
The Dual Reader Manager Pattern
One of InternalEngine's most interesting design choices is maintaining two separate reader managers:
externalReaderManager— used for search operations. Refreshed periodically (default: every 1 second) to provide near-real-time search.internalReaderManager— used for version lookups during indexing. Refreshed more aggressively to ensure version checks see the latest writes.
This separation means search refreshes don't block indexing version checks, and vice versa. The LiveVersionMap acts as a write-through cache for recently indexed document versions, allowing version checks to succeed without waiting for a Lucene refresh at all.
The Write Path: From Index Operation to Disk
Tracing a document index operation through InternalEngine reveals the careful ordering of durability and indexing steps:
flowchart TD
OP[Index Operation arrives] --> PERM[Acquire operation permit]
PERM --> VER{Version check<br/>via LiveVersionMap}
VER -->|Conflict| REJECT[Reject: VersionConflictException]
VER -->|OK| SEQNO[Assign sequence number<br/>via LocalCheckpointTracker]
SEQNO --> TL[Append to Translog<br/>durability guarantee]
TL --> LUC[Write to Lucene IndexWriter<br/>in-memory buffer]
LUC --> VMP[Update LiveVersionMap<br/>new version visible for checks]
VMP --> ACK[Acknowledge to client]
ACK -.->|Async, every 1s| REFRESH[Refresh: make searchable]
ACK -.->|Async, on flush| COMMIT[Lucene commit + Translog trim]
The critical insight is the ordering: Translog append happens before Lucene write. This means if the process crashes between these two steps, the operation is in the Translog but not in Lucene — it will be replayed during recovery. If the Translog append fails, the Lucene write never happens.
The Lucene IndexWriter.addDocument() call only writes to an in-memory buffer. The document isn't searchable until a refresh creates a new DirectoryReader segment. But it's durable immediately because the Translog has it.
Tip: Understanding this write path explains many Elasticsearch behaviors. The 1-second "near-real-time" delay for search is the refresh interval. The
_refreshAPI forces a reader reopen. The_flushAPI triggers a Lucene commit and allows Translog trimming. Andindex.translog.durability: requestvsasynccontrols whether each operation waits for an fsync.
Translog: Write-Ahead Log for Durability
The Translog class provides per-operation durability that bridges the gap between Elasticsearch's "acknowledge after every operation" semantics and Lucene's batch-oriented commit model.
flowchart LR
subgraph Active Generation
TW[TranslogWriter<br/>translog-N.tlog]
CP[translog.ckp<br/>Checkpoint file]
end
subgraph Older Generations
TR1[TranslogReader<br/>translog-N-1.tlog]
CP1[translog-N-1.ckp]
TR2[TranslogReader<br/>translog-N-2.tlog]
end
OP1[Op 1] --> TW
OP2[Op 2] --> TW
OP3[Op 3] --> TW
TW --> CP
TR1 -.->|Retained until<br/>global checkpoint advances| GC[Trimmed after<br/>Lucene commit]
The Translog uses a generation-based file model. The current generation is an active TranslogWriter. When a generation threshold is reached (size-based, or on primary term change), the current writer is sealed as a read-only TranslogReader and a new writer opens.
Each generation has an associated checkpoint file (.ckp) that records the number of operations, the fsynced offset, and the minimum/maximum sequence numbers. The checkpoint is fsynced first, then the translog data — this ordering ensures that recovery can always determine the last consistent state.
When Generations Get Trimmed
A translog generation can be deleted only after two conditions are met:
- All operations in that generation have been committed to a Lucene segment (via
IndexWriter.commit()) - The global checkpoint has advanced past the generation's maximum sequence number
This retention policy is what enables efficient peer recovery — a recovering replica only needs to replay translog operations after its last known checkpoint, rather than copying entire Lucene segments.
Sequence Numbers and the Global Checkpoint
The sequence number system is Elasticsearch's mechanism for ordering operations across the cluster and enabling efficient recovery. Every operation — index, delete, or no-op — receives a monotonically increasing sequence number assigned by the primary shard.
sequenceDiagram
participant P as Primary Shard
participant R1 as Replica 1
participant R2 as Replica 2
Note over P: Local checkpoint: 42
P->>P: Index doc (seq# 43)
P->>R1: Replicate (seq# 43)
P->>R2: Replicate (seq# 43)
R1-->>P: ACK seq# 43
Note over P: R1 local checkpoint: 43
R2-->>P: ACK seq# 43
Note over P: R2 local checkpoint: 43
Note over P: Global checkpoint: 42 → 43<br/>(all replicas caught up)
P->>R1: GlobalCheckpointSync(43)
P->>R2: GlobalCheckpointSync(43)
The local checkpoint on each shard tracks the highest contiguous sequence number that shard has completed. The global checkpoint is the minimum of all local checkpoints across all copies (primary + replicas) — it represents the highest sequence number that is guaranteed to be durable on every copy.
The LocalCheckpointTracker in InternalEngine manages the local checkpoint. It handles out-of-order completion — operations might complete in different orders than they were assigned, but the local checkpoint only advances when all operations up to that point are complete.
The global checkpoint enables three critical optimizations:
- Translog trimming — generations below the global checkpoint can be deleted after the next Lucene commit
- Peer recovery — a recovering replica only replays operations after the last known global checkpoint
- Soft deletes — Lucene can merge away deleted documents once their sequence numbers are below the global checkpoint
Tip: The sequence number system is why Elasticsearch recovery is typically fast. Rather than copying gigabytes of Lucene segments, a replica that falls behind only needs to replay the translog operations since its last checkpoint. Watch the
seq_noandglobal_checkpointvalues in the_recoveryAPI output to understand where a recovery is starting from.
Where to Go Next
We've now explored both the coordination plane (Part 3) and the data plane (this article). In Part 5: The Request Pipeline, we'll connect these layers by tracing the complete lifecycle of a request from HTTP arrival through REST routing, transport dispatch, action execution patterns, and finally the distributed search pipeline.