Read OSS

The Supervisor: How cloudflared Manages Edge Connections

Advanced

Prerequisites

  • Article 1: cloudflared Architecture (codebase layout and config assembly)
  • Go concurrency patterns (goroutines, channels, errgroup, context cancellation)
  • Basic understanding of QUIC and HTTP/2 protocols

The Supervisor: How cloudflared Manages Edge Connections

As we saw in Part 1, once prepareTunnelConfig() assembles the configuration and the Orchestrator is created, the supervisor takes over. It's the component responsible for the most critical operational concern: keeping cloudflared reliably connected to Cloudflare's edge network, even in the face of network instability, protocol incompatibilities, and rolling deployments.

The Supervisor is deceptively simple in structure — it's an event loop coordinating goroutines — but the subtleties of its retry logic, protocol fallback strategy, and edge address management make it one of the most sophisticated parts of the codebase.

Supervisor Structure and Event Loop

The Supervisor struct manages the lifecycle of all edge connections:

type Supervisor struct {
    config                  *TunnelConfig
    orchestrator            *orchestration.Orchestrator
    edgeIPs                 *edgediscovery.Edge
    edgeTunnelServer        TunnelServer
    tunnelErrors            chan tunnelError
    tunnelsConnecting       map[int]chan struct{}
    tunnelsProtocolFallback map[int]*protocolFallback
    nextConnectedIndex      int
    nextConnectedSignal     chan struct{}
    log                     *ConnAwareLogger
    reconnectCh             chan ReconnectSignal
    gracefulShutdownC       <-chan struct{}
}

Two maps are the key to understanding per-connection state: tunnelsConnecting tracks which connections are currently in the process of connecting, and tunnelsProtocolFallback maintains per-connection protocol state and retry backoff. Each connection index (0 through HAConnections-1) has independent fallback state — this means connection 0 could be using QUIC while connection 2 has already fallen back to HTTP/2.

The Run() method is a classic Go select-loop that monitors four channels:

stateDiagram-v2
    [*] --> Initializing: Run() called
    Initializing --> Running: First tunnel connected
    Initializing --> Failed: initialization error
    
    Running --> HandlingError: tunnelError received
    HandlingError --> Running: ReconnectSignal (immediate)
    HandlingError --> WaitingBackoff: Error needs retry
    HandlingError --> Exiting: All tunnels done
    
    WaitingBackoff --> Running: backoff timer fired
    
    Running --> ShuttingDown: gracefulShutdownC closed
    Running --> Exiting: ctx.Done()
    ShuttingDown --> Exiting: All tunnels drained

The event loop handles:

  • ctx.Done(): Context cancellation — drains all active tunnel errors and exits
  • tunnelErrors: Connection terminated — either reconnect immediately (for ReconnectSignal) or queue for backoff retry
  • backoffTimer: Backoff period elapsed — restart all waiting tunnels
  • nextConnectedSignal: A tunnel successfully connected — possibly clear the backoff timer
  • gracefulShutdownC: Enter shutdown mode — stop starting new tunnels, let existing ones drain

Connection Initialization: First Tunnel Then HA

The initialize() method reveals an important design decision: the first connection is special.

sequenceDiagram
    participant S as Supervisor
    participant E0 as EdgeTunnel[0]
    participant E1 as EdgeTunnel[1]
    participant E2 as EdgeTunnel[2]
    participant E3 as EdgeTunnel[3]

    S->>E0: startFirstTunnel() (goroutine)
    Note over S: Block waiting for<br/>connectedSignal, tunnelError,<br/>or gracefulShutdown
    E0-->>S: connectedSignal ✓
    S->>E1: startTunnel(1)
    Note over S: Sleep(registrationInterval)
    S->>E2: startTunnel(2)
    Note over S: Sleep(registrationInterval)
    S->>E3: startTunnel(3)
    Note over S: Enter main event loop

The first connection validates the entire configuration: credentials, tunnel ID, ingress rules, and protocol selection. Only after it successfully registers does the supervisor launch remaining HA connections. This prevents a misconfigured tunnel from hammering the edge with N simultaneous failed registration attempts.

Note the time.Sleep(registrationInterval) (1 second) between launching HA connections — this staggers registrations to avoid edge rate limiting.

The startFirstTunnel() method also has special retry logic. It retries on specific error types (duplicate connection, QUIC timeouts, dial errors, control stream errors) but bails on unknown errors. It also specifically retries on Unauthorized errors, hoping for edge propagation lag to resolve — a pragmatic choice for newly created tunnels.

EdgeTunnelServer: The Connection Lifecycle

The EdgeTunnelServer.Serve() method is where individual connections live their lifecycle. It's a rich function that manages address assignment, protocol selection, connection establishment, error handling, and protocol fallback.

flowchart TD
    Start["Serve(connIndex)"] --> GetAddr["edgeAddrs.GetAddr(connIndex)"]
    GetAddr --> ServeTunnel["serveTunnel()"]
    ServeTunnel --> ServeConn["serveConnection()"]
    ServeConn --> Switch{"Protocol?"}
    Switch -->|QUIC| QUIC["serveQUIC()"]
    Switch -->|HTTP/2| H2["serveHTTP2()"]
    
    QUIC --> ErrorCheck{"Error?"}
    H2 --> ErrorCheck
    
    ErrorCheck -->|nil| Done["Return nil"]
    ErrorCheck -->|Error| RotateIP{"Should rotate IP?"}
    RotateIP -->|Yes| NewIP["GetDifferentAddr()"]
    RotateIP -->|No| Backoff["Wait backoff timer"]
    NewIP --> FallbackCheck{"Max retries?"}
    FallbackCheck -->|Yes| FallbackProto["shouldFallbackProtocol = true"]
    FallbackCheck -->|No| Backoff
    FallbackProto --> Backoff
    Backoff --> ProtoFallback{"Should fallback protocol?"}
    ProtoFallback -->|Yes| SelectNext["selectNextProtocol()"]
    ProtoFallback -->|No| Return["Return error"]
    SelectNext --> Return

The serveTunnel()serveConnection() call chain is where the actual protocol connection happens. In serveConnection(), the switch on protocol type determines the connection path:

  • QUIC: Dials a UDP connection, wraps it in a quic.Connection, creates the appropriate datagram handler (v2 or v3 based on feature flags), and serves via errgroup
  • HTTP/2: Dials a TCP connection with TLS, creates an HTTP2Connection, and serves

Both paths use errgroup to run the connection serve loop alongside a reconnect listener, ensuring clean shutdown when either path errors.

Protocol Selection and Fallback

The protocol fallback system is one of cloudflared's most important reliability mechanisms. The protocolFallback struct wraps a BackoffHandler with protocol state:

type protocolFallback struct {
    retry.BackoffHandler
    protocol   connection.Protocol
    inFallback bool
}

The selectNextProtocol() function implements the fallback decision:

  1. If max retries are reached or QUIC appears broken (idle timeout, operation not permitted), attempt fallback
  2. The fallback chain is QUIC → HTTP/2 → none (defined in connection/protocol.go#L41-L50)
  3. If already using the fallback protocol, there's no further fallback — stop retrying
  4. An important guard: if any connection has already succeeded with the current protocol (checked via ConnTracker), don't fallback — the protocol works, this is likely a transient edge issue

The initial protocol selection in NewProtocolSelector() uses a hash-based approach: the account tag is hashed with FNV-32a, and the result modulo 100 produces a threshold that's compared against remote protocol percentages fetched via DNS TXT records. This enables Cloudflare to gradually roll out QUIC by increasing the QUIC percentage — accounts hash to a stable position, so they won't flip-flop between protocols.

Tip: If QUIC connections keep failing, check if UDP port 7844 is being blocked by a firewall. The isQuicBroken() function specifically detects idle timeout errors and operation not permitted transport errors as indicators of UDP egress being blocked.

Edge Discovery via DNS

Before any connection can be established, cloudflared needs edge addresses. The edgediscovery package handles this through DNS SRV record resolution.

sequenceDiagram
    participant S as Supervisor
    participant E as Edge
    participant DNS as DNS Resolver
    
    S->>E: ResolveEdge(region, ipVersion)
    E->>DNS: SRV lookup for<br/>_argotunnel._tcp
    DNS-->>E: List of edge addresses
    E->>E: Build address pool<br/>with per-connection allocation
    
    S->>E: GetAddr(connIndex=0)
    E-->>S: EdgeAddr{UDP, TCP}
    
    Note over S: Connection fails...
    S->>E: GetDifferentAddr(connIndex=0)
    E->>E: GiveBack old address<br/>Mark connectivity error
    E-->>S: New EdgeAddr

The Edge struct maintains a pool of addresses with per-connection allocation. Key methods:

  • GetAddr(connIndex): Returns the address already assigned to this connection, or assigns a new unused one
  • GetDifferentAddr(connIndex, hasConnectivityError): Gives back the current address and gets a new one — used on connection failures
  • GiveBack(addr, hasConnectivityError): Returns an address to the pool, tagging it with whether there was a connectivity error

The ipAddrFallback handler in supervisor/tunnel.go#L129-L167 determines when to rotate addresses. Duplicate connection errors and QUIC idle timeouts trigger immediate rotation. Dial errors track retries per connection index and report a ConnectivityError with a max-retries flag — when max retries are reached, the next protocol fallback cycle is triggered.

Post-Quantum TLS and Feature Flags

cloudflared supports post-quantum key exchange for QUIC connections, with platform-aware curve selection defined in supervisor/pqtunnels.go:

Mode Non-FIPS FIPS
Strict X25519MLKEM768 only P256Kyber768Draft00 only
Prefer X25519MLKEM768 + existing curves P256Kyber768Draft00 + P256

The curvePreference() function selects curves based on the PQ mode and FIPS compliance. In strict mode, only PQ curves are allowed — this means connections will fail if the edge doesn't support the specific PQ curve. In prefer mode, PQ curves are prepended to the existing curve list, allowing graceful fallback.

The feature flag system that controls PQ mode and datagram version selection lives in features/selector.go. The FeatureSelector resolves DNS TXT records from cfd-features.argotunnel.com, parses a JSON payload with percentage fields, and uses per-account hashing to determine which features are enabled:

flowchart LR
    DNS["DNS TXT<br/>cfd-features.argotunnel.com"] --> Parse["Parse JSON<br/>{dv3_2: 50}"]
    Parse --> Hash["FNV-32a(accountTag) % 100"]
    Hash --> Compare{"accountHash < percentage?"}
    Compare -->|Yes| Enabled["Feature Enabled<br/>(DatagramV3)"]
    Compare -->|No| Disabled["Feature Disabled<br/>(DatagramV2)"]

The selector refreshes hourly (defaultLookupFreq = time.Hour) in a background goroutine. This means feature flag changes propagate within an hour without requiring tunnel restarts. The hash-based threshold ensures that when Cloudflare bumps a percentage from 30% to 50%, only accounts hashing between 30-49 are newly affected — accounts already enabled stay enabled.

Tip: You can override feature flags locally using the --features CLI flag. For example, --features datagram-v3-2 forces datagram v3 regardless of the remote percentage.

What's Next

Now that you understand how the Supervisor manages connections and handles fallback, the next article dives into what happens inside those connections. We'll explore the QUIC connection's three-goroutine errgroup architecture, the HTTP/2 connection's handler-based model, and the Cap'n Proto control stream that registers tunnels with the edge.