Conversation
Enable blocky to reload its configuration without DNS downtime via
SIGHUP signal, file watching (fsnotify + polling), and API endpoint.
Reload triggers:
- SIGHUP signal for immediate reload
- File watcher (fsnotify + polling fallback) via configWatch config
- POST /api/config/reload endpoint with success/failure feedback
- GET /api/config endpoint returns active config as YAML
Architecture:
- Atomic resolver chain swap using atomic.Value/Pointer
- Smart list cache reuse (skip re-downloading unchanged blocklists)
- TLS certificate hot-reload via GetCertificate callback
- Dynamic OpenAPI resolver lookup (API endpoints always use current chain)
- Invalid configs safely rejected (old config kept running)
Observability:
- Prometheus metrics: blocky_config_reload_total{status}, blocky_config_reload_timestamp
- Structured log messages for reload success/failure
Restart-required properties (logged as warnings):
- ports.* (listeners not recreated)
- redis.* (client lifecycle separate)
Closes #607, closes #462
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2022 +/- ##
==========================================
- Coverage 89.52% 88.89% -0.64%
==========================================
Files 95 97 +2
Lines 6960 7319 +359
==========================================
+ Hits 6231 6506 +275
- Misses 559 617 +58
- Partials 170 196 +26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds live configuration reload to Blocky to avoid DNS downtime, supporting reload via SIGHUP, config file watching, and new HTTP API endpoints, while swapping the active resolver chain atomically.
Changes:
- Add
Server.Reload()+ atomic resolver/config swapping, plus cache-reuse hooks during reload. - Introduce config file watching (
fsnotify+ polling fallback) and wire it intocmd/serve. - Add REST endpoints to trigger reload and to return the active config, and expose new Prometheus reload metrics.
Reviewed changes
Copilot reviewed 31 out of 33 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
server/server.go |
Adds atomic resolver/config storage and core reload implementation + TLS hot-reload integration. |
server/server_test.go |
Updates tests for new NewServer signature and panic-path coverage. |
server/server_reload_test.go |
Adds unit tests for concurrent reload rejection, invalid path handling, and helper functions. |
server/server_endpoints.go |
Makes OpenAPI implementation resolve resolver interfaces dynamically from the active chain. |
server/server_config_trigger.go |
Adds SIGHUP reload trigger on non-Windows builds. |
server/server_config_trigger_windows.go |
Adds no-op reload trigger on Windows. |
server/cert_provider.go |
Adds polling-based TLS certificate hot-reload provider. |
server/cert_provider_test.go |
Adds CertProvider tests for load/reload/update-paths behavior. |
resolver/hosts_file_resolver.go |
Adds options to reuse hosts data across reloads and accessor for existing data. |
resolver/blocking_resolver.go |
Adds options-based constructor to reuse list caches and special reload initialization path. |
resolver/blocking_resolver_test.go |
Adds a test asserting list cache pointer reuse in reload mode. |
metrics/metrics.go |
Adds new Prometheus metrics for config reload totals and last-success timestamp. |
metrics/metrics_test.go |
Updates registry completeness assertions to include new metrics. |
go.mod |
Promotes fsnotify to a direct dependency and updates version. |
go.sum |
Updates checksums for fsnotify dependency change. |
e2e/containers.go |
Adds helpers to modify container files and send signals via Docker API. |
e2e/config_reload_test.go |
Adds e2e coverage for SIGHUP reload and API-triggered reload. |
docs/configuration.md |
Documents reload triggers, configWatch, API endpoints, and metrics. |
docs/config.yml |
Adds configWatch example configuration. |
docs/api/openapi.yaml |
Adds /config and /config/reload API paths. |
config/watcher.go |
Implements config watcher with fsnotify + polling and cooldown. |
config/watcher_test.go |
Adds tests for watcher file/dir change detection and error tolerance. |
config/upstream.go |
Comment formatting/clarity tweaks. |
config/dnssec.go |
Comment formatting/clarity tweaks. |
config/config.go |
Adds ConfigWatch config section and changes validation to return errors (not fatal). |
config/config_test.go |
Adds tests for invalid DNS64 validation and ConfigWatch defaults. |
cmd/serve.go |
Passes configPath into server and wires ConfigWatcher startup. |
api/mocks_test.go |
Extends autogenerated mocks for new interfaces. |
api/api_server.gen.go |
Regenerates server stubs to include new config endpoints. |
api/api_interface_impl.go |
Adds GetConfig/ConfigReload handlers and dynamic resolver lookup abstraction. |
api/api_interface_impl_test.go |
Adds tests for new config API behaviors. |
api/api_client.gen.go |
Regenerates client to include new config endpoints. |
.gitignore |
Adds docs/superpowers/ to ignore list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| resolverCtx, resolverCancel := context.WithCancel(context.Background()) | ||
|
|
||
| bootstrap, err := s.reloadBootstrap(resolverCtx, oldCfg, newCfg) |
There was a problem hiding this comment.
Reload() creates resolverCtx from context.Background(), so after the first successful reload the new resolver chain (and any goroutines it starts, e.g. periodic refreshes) will no longer be canceled when the server’s main ctx is canceled during shutdown. This can leak goroutines and prolong shutdown. Consider deriving resolverCtx from the server’s root ctx (store it on Server) and using context.WithCancel(s.rootCtx) instead of Background().
| log.Configure(&newCfg.Log) | ||
|
|
There was a problem hiding this comment.
loadAndValidateConfig() applies log.Configure(&newCfg.Log) before the new resolver chain is successfully built/applied. If createQueryResolver later fails, the reload is rejected but logging will still switch to the new config, which violates the “old config kept running” guarantee. Move log.Configure to after applyReload succeeds, or revert logging on reload failure.
| log.Configure(&newCfg.Log) |
| if oldBlocking, blErr := resolver.GetFromChainWithType[*resolver.BlockingResolver](oldResolver); blErr == nil { | ||
| if listsUnchanged(oldCfg.Blocking, newCfg.Blocking) { | ||
| opts.blockingOpts.ExistingDenylistCache = oldBlocking.DenylistMatcher() | ||
| opts.blockingOpts.ExistingAllowlistCache = oldBlocking.AllowlistMatcher() | ||
| } | ||
| } |
There was a problem hiding this comment.
Reusing the existing deny/allow ListCache instances across reloads is risky because ListCache.StartPeriodicRefresh is bound to the context used when the cache was created. After applyReload(), oldCancel() will cancel that context, which will stop periodic refresh for the reused caches permanently. To avoid stale lists after the first reload, either recreate caches, or add a way to restart periodic refresh with the new context (or create caches off a long-lived server ctx that isn’t canceled on chain swap).
| if oldBlocking, blErr := resolver.GetFromChainWithType[*resolver.BlockingResolver](oldResolver); blErr == nil { | |
| if listsUnchanged(oldCfg.Blocking, newCfg.Blocking) { | |
| opts.blockingOpts.ExistingDenylistCache = oldBlocking.DenylistMatcher() | |
| opts.blockingOpts.ExistingAllowlistCache = oldBlocking.AllowlistMatcher() | |
| } | |
| } | |
| // Do not reuse deny/allow list caches across reloads, since their periodic | |
| // refresh is bound to the context used when they were created. That context | |
| // is canceled in applyReload(), which would stop refresh for reused caches. |
| var downloader lists.FileDownloader | ||
|
|
||
| if opts.ExistingDenylistCache == nil || opts.ExistingAllowlistCache == nil { | ||
| downloader = lists.NewDownloader(cfg.Loading.Downloads, bootstrap.NewHTTPTransport()) | ||
| } | ||
|
|
||
| if opts.ExistingDenylistCache != nil { | ||
| denylistMatcher = opts.ExistingDenylistCache | ||
| } else { | ||
| denylistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeDenylist, | ||
| cfg.Loading, cfg.Denylists, downloader) | ||
| if err != nil { | ||
| return nil, nil, fmt.Errorf("failed to create denylist cache: %w", err) | ||
| } | ||
| } | ||
|
|
||
| if opts.ExistingAllowlistCache != nil { | ||
| allowlistMatcher = opts.ExistingAllowlistCache | ||
| } else { | ||
| allowlistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeAllowlist, | ||
| cfg.Loading, cfg.Allowlists, downloader) | ||
| if err != nil { | ||
| return nil, nil, fmt.Errorf("failed to create allowlist cache: %w", err) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
createListMatchers returns Existing*Cache directly without reinitializing periodic refresh under the new ctx. If the existing cache was created with a ctx that later gets canceled (e.g. when the old resolver chain is torn down during reload), the cache’s periodic refresh goroutine will stop and won’t restart, leading to stale allow/deny lists. Consider adding a ListCache method to (re)StartPeriodicRefresh with the new ctx, or avoid sharing caches across contexts.
| var downloader lists.FileDownloader | |
| if opts.ExistingDenylistCache == nil || opts.ExistingAllowlistCache == nil { | |
| downloader = lists.NewDownloader(cfg.Loading.Downloads, bootstrap.NewHTTPTransport()) | |
| } | |
| if opts.ExistingDenylistCache != nil { | |
| denylistMatcher = opts.ExistingDenylistCache | |
| } else { | |
| denylistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeDenylist, | |
| cfg.Loading, cfg.Denylists, downloader) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("failed to create denylist cache: %w", err) | |
| } | |
| } | |
| if opts.ExistingAllowlistCache != nil { | |
| allowlistMatcher = opts.ExistingAllowlistCache | |
| } else { | |
| allowlistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeAllowlist, | |
| cfg.Loading, cfg.Allowlists, downloader) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("failed to create allowlist cache: %w", err) | |
| } | |
| } | |
| downloader := lists.NewDownloader(cfg.Loading.Downloads, bootstrap.NewHTTPTransport()) | |
| denylistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeDenylist, | |
| cfg.Loading, cfg.Denylists, downloader) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("failed to create denylist cache: %w", err) | |
| } | |
| allowlistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeAllowlist, | |
| cfg.Loading, cfg.Allowlists, downloader) | |
| if err != nil { | |
| return nil, nil, fmt.Errorf("failed to create allowlist cache: %w", err) | |
| } |
| mod := cp.getModTime() | ||
| if mod.After(cp.lastMod) { | ||
| cp.lastMod = mod | ||
| if err := cp.loadCert(); err != nil { | ||
| log.PrefixedLog("server").Error("TLS certificate reload failed: ", err) | ||
| } else { | ||
| log.PrefixedLog("server").Info("TLS certificate reloaded") | ||
| } | ||
| } |
There was a problem hiding this comment.
CertProvider.lastMod is written in poll() without synchronization and also set in UpdatePaths() under cp.mu. Because poll() reads/writes lastMod concurrently with UpdatePaths(), this is a data race under the race detector. Protect lastMod with the same mutex (read/write under lock) or convert it to an atomic value.
| func (cp *CertProvider) loadCert() error { | ||
| cp.mu.RLock() | ||
| certFile, keyFile := cp.certFile, cp.keyFile | ||
| cp.mu.RUnlock() | ||
|
|
||
| cert, err := tls.LoadX509KeyPair(certFile, keyFile) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| cp.cert.Store(&cert) | ||
|
|
There was a problem hiding this comment.
loadCert() stores the tls.Certificate returned by tls.LoadX509KeyPair without populating cert.Leaf. Elsewhere in the codebase (e.g. util.TLSGenerateSelfSignedCert) Leaf is explicitly set to avoid parsing on every handshake, and your tests compare/use Leaf. Consider parsing cert.Certificate[0] with x509.ParseCertificate and assigning cert.Leaf before storing.
| switch { | ||
| case strings.Contains(rsp.Header.Get("Content-Type"), "yaml") && rsp.StatusCode == 200: | ||
| var dest string | ||
| if err := yaml.Unmarshal(bodyBytes, &dest); err != nil { | ||
| return nil, err | ||
| } | ||
| response.YAML200 = &dest | ||
|
|
There was a problem hiding this comment.
ParseGetConfigResponse unmarshals the response body as YAML into a Go string. But the server returns YAML text representing a mapping (the config), so unmarshaling into string will fail ("cannot unmarshal !!map into string"). The client should treat the body as plain text (e.g. YAML200 = string(bodyBytes)) rather than yaml.Unmarshal into a scalar string, or adjust the OpenAPI schema/content type so codegen matches the actual payload.
| func registerReloadTrigger(ctx context.Context, s *Server) { | ||
| signals := make(chan os.Signal, 1) | ||
| signal.Notify(signals, syscall.SIGHUP) | ||
|
|
||
| go func() { |
There was a problem hiding this comment.
registerReloadTrigger registers with signal.Notify but never calls signal.Stop(signals) on ctx.Done(). This can keep delivering signals to an unconsumed channel after shutdown and retains global signal handlers longer than necessary. Consider deferring signal.Stop(signals) and optionally closing the channel when the goroutine exits.
| data, err := yaml.Marshal(cfg) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to marshal config: %w", err) | ||
| } | ||
|
|
||
| return GetConfig200TextyamlResponse{ | ||
| Body: bytes.NewReader(data), | ||
| ContentLength: int64(len(data)), | ||
| }, nil | ||
| } | ||
|
|
There was a problem hiding this comment.
GetConfig returns the full active config marshaled as YAML. The config struct contains secrets (e.g. redis.password / redis.sentinelPassword), so this endpoint will expose credentials in cleartext to anyone who can reach the HTTP API. Consider redacting secret fields before marshaling (e.g. copy cfg and blank passwords), omitting sensitive sections, or gating the endpoint behind an authentication/authorization mechanism if one exists.
| data, err := yaml.Marshal(cfg) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to marshal config: %w", err) | |
| } | |
| return GetConfig200TextyamlResponse{ | |
| Body: bytes.NewReader(data), | |
| ContentLength: int64(len(data)), | |
| }, nil | |
| } | |
| rawData, err := yaml.Marshal(cfg) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to marshal config: %w", err) | |
| } | |
| redactedData, err := redactYAMLSecrets(rawData) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to redact config: %w", err) | |
| } | |
| return GetConfig200TextyamlResponse{ | |
| Body: bytes.NewReader(redactedData), | |
| ContentLength: int64(len(redactedData)), | |
| }, nil | |
| } | |
| // redactYAMLSecrets takes a YAML document and returns a copy with likely secret | |
| // values (e.g. passwords, tokens) replaced by a placeholder string. | |
| func redactYAMLSecrets(data []byte) ([]byte, error) { | |
| var v interface{} | |
| if err := yaml.Unmarshal(data, &v); err != nil { | |
| return nil, err | |
| } | |
| redactNode(&v) | |
| return yaml.Marshal(v) | |
| } | |
| // redactNode walks an arbitrary YAML-decoded structure and redacts values of | |
| // keys that look sensitive. | |
| func redactNode(v *interface{}) { | |
| switch node := (*v).(type) { | |
| case map[interface{}]interface{}: | |
| for k, val := range node { | |
| keyStr := fmt.Sprint(k) | |
| if isSensitiveKey(keyStr) { | |
| node[k] = "***redacted***" | |
| continue | |
| } | |
| sub := val | |
| redactNode(&sub) | |
| node[k] = sub | |
| } | |
| case map[string]interface{}: | |
| for k, val := range node { | |
| if isSensitiveKey(k) { | |
| node[k] = "***redacted***" | |
| continue | |
| } | |
| sub := val | |
| redactNode(&sub) | |
| node[k] = sub | |
| } | |
| case []interface{}: | |
| for i, val := range node { | |
| sub := val | |
| redactNode(&sub) | |
| node[i] = sub | |
| } | |
| default: | |
| // scalars are left as-is | |
| } | |
| } | |
| // isSensitiveKey reports whether a key name likely refers to a secret value. | |
| func isSensitiveKey(key string) bool { | |
| l := strings.ToLower(key) | |
| return strings.Contains(l, "password") || | |
| strings.Contains(l, "passwd") || | |
| strings.Contains(l, "secret") || | |
| strings.Contains(l, "token") || | |
| strings.Contains(l, "apikey") || | |
| strings.Contains(l, "api_key") | |
| } |
| if event.Has(fsnotify.Write) || event.Has(fsnotify.Create) { | ||
| logger.Info("config file change detected (fsnotify)") | ||
| if err := onReload(); err != nil { | ||
| logger.Errorf("config reload failed: %v", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
startFsnotify() only triggers reload on fsnotify.Write or fsnotify.Create. Many common “atomic save” patterns (and Kubernetes ConfigMap updates) replace the config via rename/symlink swap, which typically emits fsnotify.Rename/Remove events for the target file rather than Write. If polling is disabled (interval=0), these updates may never trigger a reload. Consider also handling Rename/Remove (and filtering by event.Name == configPath when watching a directory).
- Don't reuse ListCache across reloads (periodic refresh context dies) - Derive resolver context from server root context (fix goroutine leak) - Move log.Configure after successful reload (fix premature log switch) - Redact secrets in GET /api/config response - Handle fsnotify Rename events (atomic saves, K8s ConfigMaps) - Fix CertProvider.lastMod data race - Add signal.Stop cleanup in signal handlers - Populate cert.Leaf for performance
- Filter fsnotify events to actual config file, not entire directory - Fix cooldown race using UnixMilli and CompareAndSwap - Add 30s timeout to FQDN IP cache init during reload - Add 5s grace period before cancelling old resolver context - Add unit tests for redactSecrets function
Enable blocky to reload its configuration without DNS downtime via SIGHUP signal, file watching (fsnotify + polling), and API endpoint.
Reload triggers:
Architecture:
Observability:
Restart-required properties (logged as warnings):
Closes #607, closes #462