Read OSS

Inside the Agentic Loop: How Gemini CLI Processes a Prompt

Advanced

Prerequisites

  • Article 1: Architecture and Navigation Guide
  • TypeScript async generators and iterators
  • Basic understanding of LLM tool calling and streaming

Inside the Agentic Loop: How Gemini CLI Processes a Prompt

When you type a prompt into Gemini CLI, it enters a loop that can span dozens of turns — sending requests to the model, receiving streaming responses, dispatching tool calls, feeding results back, and repeating until the task is complete or the model signals it's done. This isn't a simple request-response cycle; it's a three-layer architecture with streaming events, retry logic, compression, loop detection, and hook integration at every stage.

This article traces the complete lifecycle of a user prompt through GeminiClient, Turn, and GeminiChat.

GeminiClient: The Outer Loop Orchestrator

The GeminiClient class is the entry point for all LLM interactions. It receives an AgentLoopContext (as we covered in Article 1) and manages the conversation lifecycle including loop detection, chat compression, tool output masking, and hook firing.

The core method is sendMessageStream() at line 881. It's an AsyncGenerator<ServerGeminiStreamEvent, Turn> — meaning it yields streaming events to callers while internally managing multiple model turns.

sequenceDiagram
    participant UI as UI Layer
    participant GC as GeminiClient
    participant T as Turn
    participant Chat as GeminiChat
    participant API as Gemini API
    
    UI->>GC: sendMessageStream(request, signal, prompt_id)
    GC->>GC: fireBeforeAgentHook
    GC->>GC: processTurn()
    GC->>T: new Turn(chat, prompt_id)
    T->>Chat: sendMessageStream()
    Chat->>API: generateContentStream()
    API-->>Chat: streaming chunks
    Chat-->>T: StreamEvent yields
    T-->>GC: ServerGeminiStreamEvent yields
    GC-->>UI: ServerGeminiStreamEvent yields
    GC->>GC: fireAfterAgentHook

Here's what happens in sendMessageStream():

  1. Hook state management — It fires BeforeAgent hooks and checks if execution should be stopped or blocked
  2. Turn processing — It delegates to processTurn(), which creates a Turn and manages model routing
  3. Loop detection — Events from the turn are checked against the LoopDetectionService
  4. Post-turn hooksAfterAgent hooks can stop execution, clear context, or trigger continuation turns
  5. State cleanup — Hook state is cleaned up in a finally block

The method is recursive — if an AfterAgent hook blocks execution with a reason, sendMessageStream calls itself with the block reason as a new prompt and stopHookActive: true.

Turn: The AsyncGenerator Heart

Each call to the model within the agentic loop is wrapped in a Turn instance. A Turn is responsible for converting raw API streaming responses into typed ServerGeminiStreamEvent values.

The run() method at line 253 is itself an AsyncGenerator:

async *run(
    modelConfigKey: ModelConfigKey,
    req: PartListUnion,
    signal: AbortSignal,
    displayContent?: PartListUnion,
    role: LlmRole = LlmRole.MAIN,
): AsyncGenerator<ServerGeminiStreamEvent>

It iterates over the GeminiChat.sendMessageStream() response, pattern-matching on stream event types:

  • retry events → yielded as GeminiEventType.Retry so the UI can discard partial content
  • agent_execution_stopped/blocked → yielded as hook-driven control events
  • chunk events → parsed for thoughts, content text, tool calls, citations, and finish reasons

The Turn accumulates pendingToolCalls — function calls from the model that need to be dispatched via the Scheduler. These are extracted from functionCall parts in the streaming response.

stateDiagram-v2
    [*] --> Created: new Turn()
    Created --> Running: run() called
    Running --> YieldingContent: Content parts received
    Running --> YieldingThought: Thought parts received
    Running --> CollectingToolCalls: FunctionCall parts received
    YieldingContent --> Running: continue iteration
    YieldingThought --> Running: continue iteration
    CollectingToolCalls --> Running: continue iteration
    Running --> Finished: FinishReason received
    Running --> Cancelled: signal.aborted
    Running --> Error: API error
    Finished --> [*]
    Cancelled --> [*]
    Error --> [*]

Tip: The pendingToolCalls on a Turn are critical — after sendMessageStream yields all events from a turn, the caller (typically the UI or SDK) checks turn.pendingToolCalls to dispatch tool execution through the Scheduler. The agentic loop continues only after tool results are fed back.

The ServerGeminiStreamEvent Union

The ServerGeminiStreamEvent is a discriminated union of 18 event types, defined by the GeminiEventType enum at line 52–71:

Event Type Purpose
Content Streamed text from the model
Thought Model's thinking/reasoning text
ToolCallRequest Model requests a tool execution
ToolCallResponse Result from executed tool
ToolCallConfirmation Confirmation details for user approval
ChatCompressed Context was compressed to fit window
Finished Turn completed with reason and usage metadata
Retry Stream retry in progress, discard partial content
Error An error occurred
UserCancelled User aborted the operation
LoopDetected Infinite loop detected
MaxSessionTurns Session turn limit reached
Citation Citation or grounding metadata from the model
ContextWindowWillOverflow Not enough tokens remaining
InvalidStream Stream response was invalid
ModelInfo Reports which model is being used
AgentExecutionStopped Hook stopped execution
AgentExecutionBlocked Hook blocked execution

This union is the contract between the backend and any frontend. The CLI's React components, the non-interactive handler, and the SDK all consume this same event stream, pattern-matching on event.type to drive their respective behaviors.

GeminiChat: Low-Level Session Management

GeminiChat is the lowest layer — a wrapper around the @google/genai SDK that maintains conversation history and handles mid-stream retries. The file itself is a modified fork of the upstream chats.ts, created to work around a bug where function responses weren't treated as valid responses.

The key innovation is mid-stream retry logic, configured at line 89–93:

const MID_STREAM_RETRY_OPTIONS: MidStreamRetryOptions = {
  maxAttempts: 4, // 1 initial call + 3 retries mid-stream
  initialDelayMs: 1000,
  useExponentialBackoff: true,
};

When a stream fails mid-response (network disconnect, invalid content), GeminiChat yields a RETRY event, discards partial results, and retries with exponential backoff. It also supports model fallback — if the primary model fails repeatedly, it can fall back to an alternative model via the handleFallback utility.

sequenceDiagram
    participant GChat as GeminiChat
    participant CG as ContentGenerator
    participant API as Gemini API
    
    GChat->>CG: generateContentStream(request)
    CG->>API: HTTP stream
    API-->>CG: chunk 1
    CG-->>GChat: yield chunk 1
    API--x CG: connection error
    GChat->>GChat: yield RETRY event
    GChat->>GChat: backoff (1000ms)
    GChat->>CG: generateContentStream(request) [attempt 2]
    CG->>API: HTTP stream
    API-->>CG: full response
    CG-->>GChat: yield chunks

The ContentGenerator interface abstracts the actual API calls. Implementations include the standard GoogleGenAI-backed generator, a LoggingContentGenerator for debug tracing, a RecordingContentGenerator for session replay, and a FakeContentGenerator for testing.

Chat Compression and Loop Detection

Two safety mechanisms prevent the agentic loop from running off the rails.

Chat Compression

When conversation history approaches the model's context window limit, GeminiClient.processTurn() triggers compression. The ChatCompressionService summarizes earlier turns to free up token budget. The CompressionStatus enum at line 167–185 tracks outcomes:

  • COMPRESSED — summary successfully replaced older history
  • COMPRESSION_FAILED_INFLATED_TOKEN_COUNT — the summary was actually longer
  • CONTENT_TRUNCATED — previous compression failed, so content was truncated to budget
flowchart TD
    A[processTurn starts] --> B{Context management enabled?}
    B -- Yes --> C[AgentHistoryProvider.manageHistory]
    B -- No --> D[tryCompressChat]
    D --> E{Compression succeeded?}
    E -- Yes --> F[Yield ChatCompressed event]
    E -- No --> G[Track failure for future truncation]
    C --> H[Check remaining token count]
    F --> H
    G --> H
    H --> I{Request fits in window?}
    I -- Yes --> J[Continue with model call]
    I -- No --> K[Yield ContextWindowWillOverflow]

A notable detail: if compression fails once, the client sets hasFailedCompressionAttempt = true and falls back to content truncation on subsequent attempts rather than trying (and failing) to compress again.

Loop Detection

The LoopDetectionService monitors event patterns across turns. At line 688, before each turn starts, it checks for repeating patterns. If a loop count of 1 is detected (early warning), the client attempts recovery via _recoverFromLoop(). If count exceeds 1, it yields a LoopDetected event and aborts.

Additionally, within a turn (line 757), each streaming event is fed to the loop detector for real-time monitoring. This catches loops that emerge mid-turn, such as the model repeatedly requesting the same tool call.

Hook Integration Points

As we saw in sendMessageStream(), hooks intercept the agentic loop at two key points:

BeforeAgent — Fired before the first model call in a prompt sequence. Can:

  • Stop execution entirely (yield AgentExecutionStopped)
  • Block execution with a reason (yield AgentExecutionBlocked)
  • Inject additional context into the request

AfterAgent — Fired after the model responds with no pending tool calls. Can:

  • Stop execution and optionally clear context
  • Block and trigger a continuation turn with a new prompt
  • Let execution proceed normally

The hook state is tracked per prompt_id in a Map to handle the recursive nature of sendMessageStream. The activeCalls counter ensures BeforeAgent fires only once per prompt, even if the method is called recursively for continuation turns.

Two additional hook points — BeforeModel and AfterModel — operate at the GeminiChat level, intercepting individual API calls. BeforeModel can modify request config or return a synthetic response. AfterModel can transform the response or trigger additional actions. We'll explore the full hook system in Article 5.

Tip: The hookStateMap in GeminiClient (line 143) is key to understanding hook deduplication. If you're debugging why a hook fires only once despite multiple turns, check hasFiredBeforeAgent and activeCalls.

In the next article, we'll follow what happens when the model requests a tool call — exploring the tool system's builder pattern and the Scheduler's event-driven orchestration that validates, confirms, and executes tool invocations.