Read OSS

Connection Lifecycle: The Supervisor, Protocol Fallback, and Edge Discovery

Intermediate

Prerequisites

  • Article 1: Cloudflared Architecture Overview
  • Familiarity with Go select statements and channel-based concurrency

Connection Lifecycle: The Supervisor, Protocol Fallback, and Edge Discovery

As we saw in Part 1, the tunnel initialization path ends with supervisor.StartTunnelDaemon, which creates a Supervisor and calls Run. That single function call is where cloudflared's reliability story begins. The Supervisor is responsible for maintaining up to four simultaneous connections to the Cloudflare edge, each potentially connecting to a different PoP. If a connection dies, the Supervisor restarts it. If QUIC is broken on the network, it falls back to HTTP/2. If an edge IP becomes unreachable, it rotates to a new one.

This article dissects the Supervisor's event-driven architecture, the protocol selection system, the QUIC health detection heuristic, and the DNS-based edge address discovery mechanism.

High-Availability Connection Model

Cloudflared doesn't maintain a single connection to the edge — it maintains multiple. The default is four, controlled by the --ha-connections flag. Each connection targets a different edge IP address, ideally in different PoPs, providing redundancy against both network failures and edge-side maintenance.

flowchart TD
    SUP[Supervisor] --> C0[Connection 0<br/>Edge PoP: LAX]
    SUP --> C1[Connection 1<br/>Edge PoP: SFO]
    SUP --> C2[Connection 2<br/>Edge PoP: SEA]
    SUP --> C3[Connection 3<br/>Edge PoP: DEN]
    
    style SUP fill:#7c3aed,color:#fff
    style C0 fill:#059669,color:#fff
    style C1 fill:#059669,color:#fff
    style C2 fill:#059669,color:#fff
    style C3 fill:#059669,color:#fff

There's a critical bootstrap constraint: the first connection must succeed before any HA connections are started. The initialize method launches connection 0 and blocks on a connectedSignal. Only after the first connection registers successfully does it start connections 1 through N-1, staggered by a one-second registrationInterval:

// At least one successful connection, so start the rest
for i := 1; i < s.config.HAConnections; i++ {
    s.tunnelsProtocolFallback[i] = &protocolFallback{...}
    go s.startTunnel(ctx, i, s.newConnectedTunnelSignal(i))
    time.Sleep(registrationInterval)
}

This design prevents a cascading failure where all four connections try to register simultaneously and all fail, burning through retry budgets.

The Supervisor Event Loop

The Supervisor.Run method is a classic Go select-based state machine. After initialization, it enters an infinite loop with four event sources:

stateDiagram-v2
    [*] --> Running: initialize succeeds
    Running --> TunnelError: tunnelErrors channel
    Running --> BackoffExpired: backoffTimer fires
    Running --> TunnelConnected: nextConnectedSignal
    Running --> ShuttingDown: gracefulShutdownC closed
    Running --> Done: ctx.Done()
    
    TunnelError --> Running: schedule retry
    TunnelError --> Done: no retries left
    BackoffExpired --> Running: restart waiting tunnels
    TunnelConnected --> Running: clear backoff if all connected
    ShuttingDown --> Running: set shuttingDown flag
    Done --> [*]

The four channels the select watches:

  1. ctx.Done() — Context cancellation. Drain remaining tunnel errors and exit.
  2. s.tunnelErrors — A tunnel connection died. Classify the error, schedule a retry with backoff, or reconnect immediately for ReconnectSignal errors.
  3. backoffTimer — The exponential backoff timer expired. Restart all waiting tunnels.
  4. s.nextConnectedSignal — A tunnel successfully connected. If no more tunnels are outstanding, reset the backoff grace period.

The shuttingDown flag (set via gracefulShutdownC) prevents new reconnection attempts while allowing existing connections to drain gracefully.

Tip: The Supervisor's event loop is a textbook example of how to build a reliable connection manager in Go. Study how it separates the concerns of error classification, backoff, and state tracking into distinct event handlers rather than mixing them into a single imperative loop.

EdgeTunnelServer: Per-Connection Lifecycle

While the Supervisor manages the fleet of connections, EdgeTunnelServer manages a single connection's lifecycle. Its Serve method handles the full attempt cycle for one connection index:

flowchart TD
    Start[Serve called] --> GetAddr[Get edge address for connIndex]
    GetAddr --> Connect[serveTunnel with current protocol]
    Connect --> ErrCheck{Error?}
    ErrCheck -->|None| Return[Return nil]
    ErrCheck -->|Error| Rotate{Should rotate IP?}
    Rotate -->|Yes| NewIP[GetDifferentAddr]
    Rotate -->|No| Backoff[Wait for backoff timer]
    NewIP --> Fallback{Should fallback protocol?}
    Backoff --> Fallback
    Fallback -->|Yes| Switch[selectNextProtocol]
    Fallback -->|No| Return2[Return error]
    Switch --> Return3[Return error with new protocol set]

The flow integrates three recovery strategies:

  • IP rotation via edgeAddrHandler.ShouldGetNewAddress
  • Protocol fallback via selectNextProtocol
  • Exponential backoff via the protocolFallback struct

Each connection maintains its own copy of the current protocol, because individual connections might need to fall back independently — one edge PoP might support QUIC while another doesn't.

Protocol Selection and QUIC Fallback

Cloudflared supports two transport protocols: QUIC (preferred) and HTTP/2 (fallback). The ProtocolSelector interface has three implementations:

flowchart TD
    Flag{--protocol flag?}
    Flag -->|quic or http2| Static[staticProtocolSelector<br/>No fallback]
    Flag -->|auto + token| Default[defaultProtocolSelector<br/>QUIC with HTTP/2 fallback]
    Flag -->|auto + no token| Remote[remoteProtocolSelector<br/>DNS-driven percentages]
    
    PQ{--post-quantum?}
    PQ -->|Yes| StaticQUIC[staticProtocolSelector<br/>Force QUIC]

The NewProtocolSelector factory function decides which implementation to use based on flags. The most interesting is remoteProtocolSelector, which periodically fetches protocol percentages via DNS TXT records and uses FNV32 hashing of the account tag to deterministically assign accounts to protocol buckets. The threshold comparison at switchThreshold produces a stable number from 0-99 for each account.

The QUIC health detection heuristic, isQuicBroken, checks for two specific error patterns:

func isQuicBroken(cause error) bool {
    var idleTimeoutError *quic.IdleTimeoutError
    if errors.As(cause, &idleTimeoutError) {
        return true
    }
    var transportError *quic.TransportError
    if errors.As(cause, &transportError) && 
       strings.Contains(cause.Error(), "operation not permitted") {
        return true
    }
    return false
}

IdleTimeoutError means QUIC packets aren't getting through at all (likely UDP is blocked). operation not permitted typically indicates a firewall is actively rejecting UDP. Both trigger a fallback to HTTP/2.

IP Rotation and Error Classification

The ipAddrFallback struct classifies connection errors into three strategies:

Error Type Action Connectivity Error?
nil Keep current IP No
DupConnRegisterTunnelError, IdleTimeoutError Rotate IP immediately No
DialError, EdgeQuicDialError Rotate IP + count retries Yes
Everything else Keep current IP No

The retry counter per connection index tracks how many consecutive dial failures have occurred. After maxRetries failures, the error is classified as a connectivity error with HasReachedMaxRetries() == true, which triggers protocol fallback.

This layered approach means cloudflared tries the gentlest recovery first (new IP), and only escalates to protocol fallback after exhausting IP rotation options.

Edge Address Discovery via DNS

The edgediscovery package resolves Cloudflare edge IP addresses and manages their assignment to connection indices. The Edge struct wraps a region-aware address pool:

sequenceDiagram
    participant Sup as Supervisor
    participant Edge as Edge
    participant Regions as Regions
    participant DNS as DNS Resolver

    Sup->>Edge: ResolveEdge(log, region, ipVersion)
    Edge->>DNS: Resolve SRV/A records
    DNS-->>Edge: IP addresses
    Edge->>Regions: Build region-aware pools

    Sup->>Edge: GetAddr(connIndex=0)
    Edge->>Regions: AddrUsedBy(0) → nil
    Edge->>Regions: GetUnusedAddr(nil, 0)
    Regions-->>Edge: EdgeAddr{UDP, TCP}
    
    Note over Sup: Connection fails
    Sup->>Edge: GetDifferentAddr(0, true)
    Edge->>Regions: GiveBack(oldAddr, true)
    Edge->>Regions: GetUnusedAddr(oldAddr, 0)
    Regions-->>Edge: New EdgeAddr

Key design decisions in edge discovery:

  • Per-connection assignment: Each connection index is bound to a specific address via AddrUsedBy. This prevents two connections from accidentally targeting the same PoP.
  • Give-back with health tracking: When GiveBack is called with hasConnectivityError=true, the address is deprioritized in future allocations.
  • Static edge support: For testing or custom deployments, StaticEdge accepts explicit hostnames instead of resolving via DNS.

The GetAddr method first checks if the connection already has an assigned address (for reconnection stability), then falls back to allocating an unused one. This sticky assignment means a reconnecting tunnel will try the same edge PoP first, maintaining connection affinity.

What's Next

We've seen how the Supervisor orchestrates connections and how cloudflared recovers from failures. In Part 3, we'll go deeper into the transport layer itself — how QUIC streams map to individual requests, how the control stream handshake works, how HTTP/2 differs as a fallback transport, and how the sophisticated datagram v3 system multiplexes UDP sessions and ICMP packets over QUIC datagrams.