The Supervisor: How cloudflared Manages Edge Connections
Prerequisites
- ›Article 1: cloudflared Architecture (codebase layout and config assembly)
- ›Go concurrency patterns (goroutines, channels, errgroup, context cancellation)
- ›Basic understanding of QUIC and HTTP/2 protocols
The Supervisor: How cloudflared Manages Edge Connections
As we saw in Part 1, once prepareTunnelConfig() assembles the configuration and the Orchestrator is created, the supervisor takes over. It's the component responsible for the most critical operational concern: keeping cloudflared reliably connected to Cloudflare's edge network, even in the face of network instability, protocol incompatibilities, and rolling deployments.
The Supervisor is deceptively simple in structure — it's an event loop coordinating goroutines — but the subtleties of its retry logic, protocol fallback strategy, and edge address management make it one of the most sophisticated parts of the codebase.
Supervisor Structure and Event Loop
The Supervisor struct manages the lifecycle of all edge connections:
type Supervisor struct {
config *TunnelConfig
orchestrator *orchestration.Orchestrator
edgeIPs *edgediscovery.Edge
edgeTunnelServer TunnelServer
tunnelErrors chan tunnelError
tunnelsConnecting map[int]chan struct{}
tunnelsProtocolFallback map[int]*protocolFallback
nextConnectedIndex int
nextConnectedSignal chan struct{}
log *ConnAwareLogger
reconnectCh chan ReconnectSignal
gracefulShutdownC <-chan struct{}
}
Two maps are the key to understanding per-connection state: tunnelsConnecting tracks which connections are currently in the process of connecting, and tunnelsProtocolFallback maintains per-connection protocol state and retry backoff. Each connection index (0 through HAConnections-1) has independent fallback state — this means connection 0 could be using QUIC while connection 2 has already fallen back to HTTP/2.
The Run() method is a classic Go select-loop that monitors four channels:
stateDiagram-v2
[*] --> Initializing: Run() called
Initializing --> Running: First tunnel connected
Initializing --> Failed: initialization error
Running --> HandlingError: tunnelError received
HandlingError --> Running: ReconnectSignal (immediate)
HandlingError --> WaitingBackoff: Error needs retry
HandlingError --> Exiting: All tunnels done
WaitingBackoff --> Running: backoff timer fired
Running --> ShuttingDown: gracefulShutdownC closed
Running --> Exiting: ctx.Done()
ShuttingDown --> Exiting: All tunnels drained
The event loop handles:
ctx.Done(): Context cancellation — drains all active tunnel errors and exitstunnelErrors: Connection terminated — either reconnect immediately (forReconnectSignal) or queue for backoff retrybackoffTimer: Backoff period elapsed — restart all waiting tunnelsnextConnectedSignal: A tunnel successfully connected — possibly clear the backoff timergracefulShutdownC: Enter shutdown mode — stop starting new tunnels, let existing ones drain
Connection Initialization: First Tunnel Then HA
The initialize() method reveals an important design decision: the first connection is special.
sequenceDiagram
participant S as Supervisor
participant E0 as EdgeTunnel[0]
participant E1 as EdgeTunnel[1]
participant E2 as EdgeTunnel[2]
participant E3 as EdgeTunnel[3]
S->>E0: startFirstTunnel() (goroutine)
Note over S: Block waiting for<br/>connectedSignal, tunnelError,<br/>or gracefulShutdown
E0-->>S: connectedSignal ✓
S->>E1: startTunnel(1)
Note over S: Sleep(registrationInterval)
S->>E2: startTunnel(2)
Note over S: Sleep(registrationInterval)
S->>E3: startTunnel(3)
Note over S: Enter main event loop
The first connection validates the entire configuration: credentials, tunnel ID, ingress rules, and protocol selection. Only after it successfully registers does the supervisor launch remaining HA connections. This prevents a misconfigured tunnel from hammering the edge with N simultaneous failed registration attempts.
Note the time.Sleep(registrationInterval) (1 second) between launching HA connections — this staggers registrations to avoid edge rate limiting.
The startFirstTunnel() method also has special retry logic. It retries on specific error types (duplicate connection, QUIC timeouts, dial errors, control stream errors) but bails on unknown errors. It also specifically retries on Unauthorized errors, hoping for edge propagation lag to resolve — a pragmatic choice for newly created tunnels.
EdgeTunnelServer: The Connection Lifecycle
The EdgeTunnelServer.Serve() method is where individual connections live their lifecycle. It's a rich function that manages address assignment, protocol selection, connection establishment, error handling, and protocol fallback.
flowchart TD
Start["Serve(connIndex)"] --> GetAddr["edgeAddrs.GetAddr(connIndex)"]
GetAddr --> ServeTunnel["serveTunnel()"]
ServeTunnel --> ServeConn["serveConnection()"]
ServeConn --> Switch{"Protocol?"}
Switch -->|QUIC| QUIC["serveQUIC()"]
Switch -->|HTTP/2| H2["serveHTTP2()"]
QUIC --> ErrorCheck{"Error?"}
H2 --> ErrorCheck
ErrorCheck -->|nil| Done["Return nil"]
ErrorCheck -->|Error| RotateIP{"Should rotate IP?"}
RotateIP -->|Yes| NewIP["GetDifferentAddr()"]
RotateIP -->|No| Backoff["Wait backoff timer"]
NewIP --> FallbackCheck{"Max retries?"}
FallbackCheck -->|Yes| FallbackProto["shouldFallbackProtocol = true"]
FallbackCheck -->|No| Backoff
FallbackProto --> Backoff
Backoff --> ProtoFallback{"Should fallback protocol?"}
ProtoFallback -->|Yes| SelectNext["selectNextProtocol()"]
ProtoFallback -->|No| Return["Return error"]
SelectNext --> Return
The serveTunnel() → serveConnection() call chain is where the actual protocol connection happens. In serveConnection(), the switch on protocol type determines the connection path:
- QUIC: Dials a UDP connection, wraps it in a
quic.Connection, creates the appropriate datagram handler (v2 or v3 based on feature flags), and serves via errgroup - HTTP/2: Dials a TCP connection with TLS, creates an
HTTP2Connection, and serves
Both paths use errgroup to run the connection serve loop alongside a reconnect listener, ensuring clean shutdown when either path errors.
Protocol Selection and Fallback
The protocol fallback system is one of cloudflared's most important reliability mechanisms. The protocolFallback struct wraps a BackoffHandler with protocol state:
type protocolFallback struct {
retry.BackoffHandler
protocol connection.Protocol
inFallback bool
}
The selectNextProtocol() function implements the fallback decision:
- If max retries are reached or QUIC appears broken (idle timeout,
operation not permitted), attempt fallback - The fallback chain is
QUIC → HTTP/2 → none(defined inconnection/protocol.go#L41-L50) - If already using the fallback protocol, there's no further fallback — stop retrying
- An important guard: if any connection has already succeeded with the current protocol (checked via
ConnTracker), don't fallback — the protocol works, this is likely a transient edge issue
The initial protocol selection in NewProtocolSelector() uses a hash-based approach: the account tag is hashed with FNV-32a, and the result modulo 100 produces a threshold that's compared against remote protocol percentages fetched via DNS TXT records. This enables Cloudflare to gradually roll out QUIC by increasing the QUIC percentage — accounts hash to a stable position, so they won't flip-flop between protocols.
Tip: If QUIC connections keep failing, check if UDP port 7844 is being blocked by a firewall. The
isQuicBroken()function specifically detects idle timeout errors andoperation not permittedtransport errors as indicators of UDP egress being blocked.
Edge Discovery via DNS
Before any connection can be established, cloudflared needs edge addresses. The edgediscovery package handles this through DNS SRV record resolution.
sequenceDiagram
participant S as Supervisor
participant E as Edge
participant DNS as DNS Resolver
S->>E: ResolveEdge(region, ipVersion)
E->>DNS: SRV lookup for<br/>_argotunnel._tcp
DNS-->>E: List of edge addresses
E->>E: Build address pool<br/>with per-connection allocation
S->>E: GetAddr(connIndex=0)
E-->>S: EdgeAddr{UDP, TCP}
Note over S: Connection fails...
S->>E: GetDifferentAddr(connIndex=0)
E->>E: GiveBack old address<br/>Mark connectivity error
E-->>S: New EdgeAddr
The Edge struct maintains a pool of addresses with per-connection allocation. Key methods:
GetAddr(connIndex): Returns the address already assigned to this connection, or assigns a new unused oneGetDifferentAddr(connIndex, hasConnectivityError): Gives back the current address and gets a new one — used on connection failuresGiveBack(addr, hasConnectivityError): Returns an address to the pool, tagging it with whether there was a connectivity error
The ipAddrFallback handler in supervisor/tunnel.go#L129-L167 determines when to rotate addresses. Duplicate connection errors and QUIC idle timeouts trigger immediate rotation. Dial errors track retries per connection index and report a ConnectivityError with a max-retries flag — when max retries are reached, the next protocol fallback cycle is triggered.
Post-Quantum TLS and Feature Flags
cloudflared supports post-quantum key exchange for QUIC connections, with platform-aware curve selection defined in supervisor/pqtunnels.go:
| Mode | Non-FIPS | FIPS |
|---|---|---|
| Strict | X25519MLKEM768 only | P256Kyber768Draft00 only |
| Prefer | X25519MLKEM768 + existing curves | P256Kyber768Draft00 + P256 |
The curvePreference() function selects curves based on the PQ mode and FIPS compliance. In strict mode, only PQ curves are allowed — this means connections will fail if the edge doesn't support the specific PQ curve. In prefer mode, PQ curves are prepended to the existing curve list, allowing graceful fallback.
The feature flag system that controls PQ mode and datagram version selection lives in features/selector.go. The FeatureSelector resolves DNS TXT records from cfd-features.argotunnel.com, parses a JSON payload with percentage fields, and uses per-account hashing to determine which features are enabled:
flowchart LR
DNS["DNS TXT<br/>cfd-features.argotunnel.com"] --> Parse["Parse JSON<br/>{dv3_2: 50}"]
Parse --> Hash["FNV-32a(accountTag) % 100"]
Hash --> Compare{"accountHash < percentage?"}
Compare -->|Yes| Enabled["Feature Enabled<br/>(DatagramV3)"]
Compare -->|No| Disabled["Feature Disabled<br/>(DatagramV2)"]
The selector refreshes hourly (defaultLookupFreq = time.Hour) in a background goroutine. This means feature flag changes propagate within an hour without requiring tunnel restarts. The hash-based threshold ensures that when Cloudflare bumps a percentage from 30% to 50%, only accounts hashing between 30-49 are newly affected — accounts already enabled stay enabled.
Tip: You can override feature flags locally using the
--featuresCLI flag. For example,--features datagram-v3-2forces datagram v3 regardless of the remote percentage.
What's Next
Now that you understand how the Supervisor manages connections and handles fallback, the next article dives into what happens inside those connections. We'll explore the QUIC connection's three-goroutine errgroup architecture, the HTTP/2 connection's handler-based model, and the Cap'n Proto control stream that registers tunnels with the edge.