Dynamic Configuration: The Orchestrator, Feature Flags, and Runtime Updates
Prerequisites
- ›Article 1: Cloudflared Architecture Overview
- ›Article 4: Traffic Routing and the Proxy Layer
- ›Understanding of Go's sync/atomic package and atomic.Value semantics
Dynamic Configuration: The Orchestrator, Feature Flags, and Runtime Updates
Most reverse proxies require a restart to pick up configuration changes. Cloudflared doesn't. When the Cloudflare edge pushes a new configuration — new ingress rules, changed origin settings, updated flow limits — cloudflared applies it without dropping a single in-flight request. The mechanism behind this is the Orchestrator's copy-on-write proxy pattern, one of the most elegant design decisions in the codebase.
This article dissects how runtime configuration updates work, from the four-layer configuration hierarchy through the atomic proxy swap, and then explores the DNS-based feature flag system that controls gradual rollouts of new capabilities like datagram v3.
The Configuration Hierarchy
Cloudflared merges configuration from four sources, with a strict priority ordering:
flowchart TD
CLI[CLI Flags<br/>Highest Priority] --> Env[Environment Variables<br/>TUNNEL_* prefix]
Env --> File[Config File<br/>YAML in search paths]
File --> Remote[Remote Edge Config<br/>Lowest Priority]
CLI --> Merge[prepareTunnelConfig merges all layers]
Env --> Merge
File --> Merge
Remote --> Merge
Merge --> TC[supervisor.TunnelConfig]
Merge --> OC[orchestration.Config]
style CLI fill:#dc2626,color:#fff
style Remote fill:#059669,color:#fff
CLI flags have the highest priority. The prepareTunnelConfig function reads them via urfave/cli's c.String(), c.Int(), etc.
Environment variables are handled through urfave/cli's altsrc package, which maps TUNNEL_* prefixed env vars to their corresponding flags. This happens transparently during flag parsing.
Config files are searched in DefaultConfigSearchDirectories: ~/.cloudflared, ~/.cloudflare-warp, ~/cloudflare-warp, /etc/cloudflared, and /usr/local/etc/cloudflared. The search accepts both config.yml and config.yaml.
Remote edge configuration has the lowest priority — it can be overridden by any local setting. This is implemented in the Orchestrator's overrideRemoteWarpRoutingWithLocalValues method, which checks local configuration flags and overwrites remote values when local overrides exist.
Tip: When troubleshooting config issues, remember that a
--token-based tunnel with no local config file will get its ingress rules entirely from the Cloudflare dashboard. Add aconfig.ymlwith the--configflag to override specific settings while keeping the remote management benefits.
Orchestrator: The Copy-on-Write Proxy Pattern
The Orchestrator is cloudflared's configuration coordinator. Its most important field is proxy atomic.Value, which stores the current *proxy.Proxy instance:
type Orchestrator struct {
currentVersion int32
lock sync.RWMutex
proxy atomic.Value // Stores *proxy.Proxy
internalRules []ingress.Rule
config *Config
flowLimiter cfdflow.Limiter
// ...
}
The split between atomic.Value for the proxy and sync.RWMutex for everything else is deliberate. The proxy is read on every single request via GetOriginProxy():
func (o *Orchestrator) GetOriginProxy() (connection.OriginProxy, error) {
val := o.proxy.Load()
// ... type assertion ...
return proxy, nil
}
This is a completely lock-free read — no mutex, no contention, no blocking. The mutex is only held during configuration updates, which happen orders of magnitude less frequently than request processing.
sequenceDiagram
participant R1 as Request Goroutine 1
participant R2 as Request Goroutine 2
participant Orch as Orchestrator
participant Edge as Edge Config Push
R1->>Orch: GetOriginProxy() [atomic.Load]
Note over R1,Orch: Lock-free read!
R2->>Orch: GetOriginProxy() [atomic.Load]
Edge->>Orch: UpdateConfig(version, config)
Note over Orch: lock.Lock()
Orch->>Orch: Build new proxy
Orch->>Orch: proxy.Store(newProxy) [atomic]
Note over Orch: lock.Unlock()
R1->>Orch: GetOriginProxy() [returns new proxy]
Compare this to an RWMutex approach where every request would need to acquire a read lock. Under high concurrency, even read locks create contention from cache-line bouncing. The atomic.Value pattern eliminates this entirely.
Versioned Config Updates from the Edge
The UpdateConfig method handles configuration updates pushed from the Cloudflare edge:
func (o *Orchestrator) UpdateConfig(version int32, config []byte) *pogs.UpdateConfigurationResponse {
o.lock.Lock()
defer o.lock.Unlock()
if o.currentVersion >= version {
return &pogs.UpdateConfigurationResponse{
LastAppliedVersion: o.currentVersion,
}
}
// ... deserialize and apply ...
}
The version check is critical. Because cloudflared maintains up to four connections to the edge, multiple connections might push the same configuration update concurrently. The version guard ensures idempotency — only the first delivery of a new version triggers an actual update.
The initial version is -1, which means the very first remote configuration (version 0) will always be accepted. This enables the migration path from local-only to remote-managed tunnels.
Zero-Downtime Ingress Updates
The updateIngress method is where the magic happens. The key insight is in the ordering of operations:
sequenceDiagram
participant Orch as Orchestrator
participant NewOrigins as New Origins
participant OldProxy as Old Proxy
participant NewProxy as New Proxy
Orch->>Orch: Create proxyShutdownC channel
Orch->>NewOrigins: StartOrigins(log, proxyShutdownC)
Note over NewOrigins: New origins are running!
Orch->>Orch: flowLimiter.SetLimit(new limit)
Orch->>Orch: originDialerService.UpdateDefaultDialer(new settings)
Orch->>NewProxy: proxy.NewOriginProxy(newRules, ...)
Orch->>Orch: proxy.Store(newProxy) ← atomic swap
Note over Orch: New requests go to new proxy
Note over OldProxy: In-flight requests still completing on old proxy
Orch->>Orch: close(old proxyShutdownC)
Note over OldProxy: Old origins begin shutdown
Start new origins before stopping old ones. The comment in the source code explains the tradeoff:
// Start new proxy before closing the ones from last version.
// The upside is we don't need to restart proxy from last version, which can fail
// The downside is new version might have ingress rule that require previous version
// to be shutdown first
// The downside is minimized because none of the ingress.OriginService
// implementations have that requirement
This creates a brief window where both old and new origins are running. New requests immediately go to the new proxy (via the atomic store), while in-flight requests on the old proxy complete naturally. When the old proxyShutdownC is closed, the old origins begin their shutdown sequence.
The waitToCloseLastProxy goroutine (spawned at construction time) ensures the final proxy's origins are cleaned up when cloudflared shuts down.
Feature Flags via DNS TXT Records
Cloudflared's featureSelector implements a DNS-based feature flag system. It queries a TXT record at cfd-features.argotunnel.com to get a JSON payload with rollout percentages:
type featuresRecord struct {
DatagramV3Percentage uint32 `json:"dv3_2"`
}
flowchart TD
Start[featureSelector created] --> Hash[FNV32a hash of accountTag]
Hash --> Mod[hash % 100 = accountHash]
Mod --> DNS[DNS TXT lookup: cfd-features.argotunnel.com]
DNS --> Parse[Parse JSON: dv3_2 percentage]
Parse --> Compare{percentage > accountHash?}
Compare -->|Yes| Enable[Feature enabled for this account]
Compare -->|No| Disable[Feature disabled]
Refresh[Hourly refresh loop] -->|Every hour| DNS
The switchThreshold function creates a deterministic bucket for each account:
func switchThreshold(accountTag string) uint32 {
h := fnv.New32a()
_, _ = h.Write([]byte(accountTag))
return h.Sum32() % 100
}
This means an account always lands in the same bucket (0-99). Setting dv3_2 to 50 in the DNS record enables datagram v3 for roughly 50% of accounts — specifically, those whose FNV32 hash mod 100 is less than 50.
The elegance of this approach is that Cloudflare can control rollouts globally without shipping new client code. To enable datagram v3 for 10% of accounts, they set the TXT record to {"dv3_2": 10}. To do a full rollout, set it to 100. To emergency-rollback, set it to 0. All changes propagate within the DNS TTL (1 hour by default, with a 10-second lookup timeout).
The feature selector also supports CLI overrides — if a user passes --features datagram-v3-2 on the command line, it takes priority over the DNS-based evaluation.
Config File Hot-Reload in Service Mode
When cloudflared runs as a system service (no arguments), it operates in service mode using the handleServiceMode function. This mode creates a three-component pipeline for config file watching:
sequenceDiagram
participant FS as File System
participant Watcher as watcher.File
participant CM as config.FileManager
participant OM as overwatch.AppManager
participant App as AppService
FS->>Watcher: Config file changed
Watcher->>CM: File event
CM->>CM: Re-read config
CM->>OM: Configuration update
OM->>App: Restart tunnel service
Note over App: Old tunnel stops, new tunnel starts
f, err := watcher.NewFile()
configManager, err := config.NewFileManager(f, configPath, log)
serviceManager := overwatch.NewAppManager(serviceCallback)
appService := NewAppService(configManager, serviceManager, shutdownC, log)
Unlike the Orchestrator's zero-downtime swap (which preserves in-flight requests), service mode hot-reload is a full restart — it stops the old tunnel and starts a new one. This is appropriate for the service mode use case where the config file defines the tunnel identity itself, not just the routing rules.
Tip: For zero-downtime config changes in production, use Cloudflare's dashboard to manage your tunnel configuration remotely. The Orchestrator's atomic swap handles these updates without dropping connections. Service mode file-watching is more suited for development and testing scenarios.
What's Next
We've covered cloudflared's runtime adaptability — from the elegant copy-on-write proxy swap to DNS-based feature flags. In our final article, we'll survey the cross-cutting operational concerns: the Observer event system, structured logging with ConnAwareLogger, Prometheus metrics, the in-tunnel management service, runtime diagnostics, and the build system's FIPS/version handling.