Skip to content

feat: add DNS config reload without restart (#607, #462)#2022

Draft
0xERR0R wants to merge 3 commits into
mainfrom
reload
Draft

feat: add DNS config reload without restart (#607, #462)#2022
0xERR0R wants to merge 3 commits into
mainfrom
reload

Conversation

@0xERR0R

@0xERR0R 0xERR0R commented Mar 20, 2026

Copy link
Copy Markdown
Owner

Enable blocky to reload its configuration without DNS downtime via SIGHUP signal, file watching (fsnotify + polling), and API endpoint.

Reload triggers:

  • SIGHUP signal for immediate reload
  • File watcher (fsnotify + polling fallback) via configWatch config
  • POST /api/config/reload endpoint with success/failure feedback
  • GET /api/config endpoint returns active config as YAML

Architecture:

  • Atomic resolver chain swap using atomic.Value/Pointer
  • Smart list cache reuse (skip re-downloading unchanged blocklists)
  • TLS certificate hot-reload via GetCertificate callback
  • Dynamic OpenAPI resolver lookup (API endpoints always use current chain)
  • Invalid configs safely rejected (old config kept running)

Observability:

  • Prometheus metrics: blocky_config_reload_total{status}, blocky_config_reload_timestamp
  • Structured log messages for reload success/failure

Restart-required properties (logged as warnings):

  • ports.* (listeners not recreated)
  • redis.* (client lifecycle separate)

Closes #607, closes #462

Enable blocky to reload its configuration without DNS downtime via
SIGHUP signal, file watching (fsnotify + polling), and API endpoint.

Reload triggers:
- SIGHUP signal for immediate reload
- File watcher (fsnotify + polling fallback) via configWatch config
- POST /api/config/reload endpoint with success/failure feedback
- GET /api/config endpoint returns active config as YAML

Architecture:
- Atomic resolver chain swap using atomic.Value/Pointer
- Smart list cache reuse (skip re-downloading unchanged blocklists)
- TLS certificate hot-reload via GetCertificate callback
- Dynamic OpenAPI resolver lookup (API endpoints always use current chain)
- Invalid configs safely rejected (old config kept running)

Observability:
- Prometheus metrics: blocky_config_reload_total{status}, blocky_config_reload_timestamp
- Structured log messages for reload success/failure

Restart-required properties (logged as warnings):
- ports.* (listeners not recreated)
- redis.* (client lifecycle separate)

Closes #607, closes #462
@codecov

codecov Bot commented Mar 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.49420% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.89%. Comparing base (835e018) to head (e7c110b).
⚠️ Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
config/watcher.go 68.00% 22 Missing and 10 partials ⚠️
server/server.go 83.78% 15 Missing and 3 partials ⚠️
api/api_interface_impl.go 76.47% 8 Missing and 8 partials ⚠️
resolver/blocking_resolver.go 64.28% 7 Missing and 3 partials ⚠️
cmd/serve.go 14.28% 5 Missing and 1 partial ⚠️
server/server_endpoints.go 57.14% 3 Missing and 3 partials ⚠️
server/server_config_trigger.go 71.42% 4 Missing ⚠️
server/cert_provider.go 95.08% 1 Missing and 2 partials ⚠️
config/config.go 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2022      +/-   ##
==========================================
- Coverage   89.52%   88.89%   -0.64%     
==========================================
  Files          95       97       +2     
  Lines        6960     7319     +359     
==========================================
+ Hits         6231     6506     +275     
- Misses        559      617      +58     
- Partials      170      196      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds live configuration reload to Blocky to avoid DNS downtime, supporting reload via SIGHUP, config file watching, and new HTTP API endpoints, while swapping the active resolver chain atomically.

Changes:

  • Add Server.Reload() + atomic resolver/config swapping, plus cache-reuse hooks during reload.
  • Introduce config file watching (fsnotify + polling fallback) and wire it into cmd/serve.
  • Add REST endpoints to trigger reload and to return the active config, and expose new Prometheus reload metrics.

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
server/server.go Adds atomic resolver/config storage and core reload implementation + TLS hot-reload integration.
server/server_test.go Updates tests for new NewServer signature and panic-path coverage.
server/server_reload_test.go Adds unit tests for concurrent reload rejection, invalid path handling, and helper functions.
server/server_endpoints.go Makes OpenAPI implementation resolve resolver interfaces dynamically from the active chain.
server/server_config_trigger.go Adds SIGHUP reload trigger on non-Windows builds.
server/server_config_trigger_windows.go Adds no-op reload trigger on Windows.
server/cert_provider.go Adds polling-based TLS certificate hot-reload provider.
server/cert_provider_test.go Adds CertProvider tests for load/reload/update-paths behavior.
resolver/hosts_file_resolver.go Adds options to reuse hosts data across reloads and accessor for existing data.
resolver/blocking_resolver.go Adds options-based constructor to reuse list caches and special reload initialization path.
resolver/blocking_resolver_test.go Adds a test asserting list cache pointer reuse in reload mode.
metrics/metrics.go Adds new Prometheus metrics for config reload totals and last-success timestamp.
metrics/metrics_test.go Updates registry completeness assertions to include new metrics.
go.mod Promotes fsnotify to a direct dependency and updates version.
go.sum Updates checksums for fsnotify dependency change.
e2e/containers.go Adds helpers to modify container files and send signals via Docker API.
e2e/config_reload_test.go Adds e2e coverage for SIGHUP reload and API-triggered reload.
docs/configuration.md Documents reload triggers, configWatch, API endpoints, and metrics.
docs/config.yml Adds configWatch example configuration.
docs/api/openapi.yaml Adds /config and /config/reload API paths.
config/watcher.go Implements config watcher with fsnotify + polling and cooldown.
config/watcher_test.go Adds tests for watcher file/dir change detection and error tolerance.
config/upstream.go Comment formatting/clarity tweaks.
config/dnssec.go Comment formatting/clarity tweaks.
config/config.go Adds ConfigWatch config section and changes validation to return errors (not fatal).
config/config_test.go Adds tests for invalid DNS64 validation and ConfigWatch defaults.
cmd/serve.go Passes configPath into server and wires ConfigWatcher startup.
api/mocks_test.go Extends autogenerated mocks for new interfaces.
api/api_server.gen.go Regenerates server stubs to include new config endpoints.
api/api_interface_impl.go Adds GetConfig/ConfigReload handlers and dynamic resolver lookup abstraction.
api/api_interface_impl_test.go Adds tests for new config API behaviors.
api/api_client.gen.go Regenerates client to include new config endpoints.
.gitignore Adds docs/superpowers/ to ignore list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server/server.go Outdated
Comment on lines +84 to +86
resolverCtx, resolverCancel := context.WithCancel(context.Background())

bootstrap, err := s.reloadBootstrap(resolverCtx, oldCfg, newCfg)

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reload() creates resolverCtx from context.Background(), so after the first successful reload the new resolver chain (and any goroutines it starts, e.g. periodic refreshes) will no longer be canceled when the server’s main ctx is canceled during shutdown. This can leak goroutines and prolong shutdown. Consider deriving resolverCtx from the server’s root ctx (store it on Server) and using context.WithCancel(s.rootCtx) instead of Background().

Copilot uses AI. Check for mistakes.
Comment thread server/server.go Outdated
Comment on lines +123 to +124
log.Configure(&newCfg.Log)

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadAndValidateConfig() applies log.Configure(&newCfg.Log) before the new resolver chain is successfully built/applied. If createQueryResolver later fails, the reload is rejected but logging will still switch to the new config, which violates the “old config kept running” guarantee. Move log.Configure to after applyReload succeeds, or revert logging on reload failure.

Suggested change
log.Configure(&newCfg.Log)

Copilot uses AI. Check for mistakes.
Comment thread server/server.go Outdated
Comment on lines +161 to +166
if oldBlocking, blErr := resolver.GetFromChainWithType[*resolver.BlockingResolver](oldResolver); blErr == nil {
if listsUnchanged(oldCfg.Blocking, newCfg.Blocking) {
opts.blockingOpts.ExistingDenylistCache = oldBlocking.DenylistMatcher()
opts.blockingOpts.ExistingAllowlistCache = oldBlocking.AllowlistMatcher()
}
}

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the existing deny/allow ListCache instances across reloads is risky because ListCache.StartPeriodicRefresh is bound to the context used when the cache was created. After applyReload(), oldCancel() will cancel that context, which will stop periodic refresh for the reused caches permanently. To avoid stale lists after the first reload, either recreate caches, or add a way to restart periodic refresh with the new context (or create caches off a long-lived server ctx that isn’t canceled on chain swap).

Suggested change
if oldBlocking, blErr := resolver.GetFromChainWithType[*resolver.BlockingResolver](oldResolver); blErr == nil {
if listsUnchanged(oldCfg.Blocking, newCfg.Blocking) {
opts.blockingOpts.ExistingDenylistCache = oldBlocking.DenylistMatcher()
opts.blockingOpts.ExistingAllowlistCache = oldBlocking.AllowlistMatcher()
}
}
// Do not reuse deny/allow list caches across reloads, since their periodic
// refresh is bound to the context used when they were created. That context
// is canceled in applyReload(), which would stop refresh for reused caches.

Copilot uses AI. Check for mistakes.
Comment thread resolver/blocking_resolver.go Outdated
Comment on lines +189 to +214
var downloader lists.FileDownloader

if opts.ExistingDenylistCache == nil || opts.ExistingAllowlistCache == nil {
downloader = lists.NewDownloader(cfg.Loading.Downloads, bootstrap.NewHTTPTransport())
}

if opts.ExistingDenylistCache != nil {
denylistMatcher = opts.ExistingDenylistCache
} else {
denylistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeDenylist,
cfg.Loading, cfg.Denylists, downloader)
if err != nil {
return nil, nil, fmt.Errorf("failed to create denylist cache: %w", err)
}
}

if opts.ExistingAllowlistCache != nil {
allowlistMatcher = opts.ExistingAllowlistCache
} else {
allowlistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeAllowlist,
cfg.Loading, cfg.Allowlists, downloader)
if err != nil {
return nil, nil, fmt.Errorf("failed to create allowlist cache: %w", err)
}
}

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createListMatchers returns Existing*Cache directly without reinitializing periodic refresh under the new ctx. If the existing cache was created with a ctx that later gets canceled (e.g. when the old resolver chain is torn down during reload), the cache’s periodic refresh goroutine will stop and won’t restart, leading to stale allow/deny lists. Consider adding a ListCache method to (re)StartPeriodicRefresh with the new ctx, or avoid sharing caches across contexts.

Suggested change
var downloader lists.FileDownloader
if opts.ExistingDenylistCache == nil || opts.ExistingAllowlistCache == nil {
downloader = lists.NewDownloader(cfg.Loading.Downloads, bootstrap.NewHTTPTransport())
}
if opts.ExistingDenylistCache != nil {
denylistMatcher = opts.ExistingDenylistCache
} else {
denylistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeDenylist,
cfg.Loading, cfg.Denylists, downloader)
if err != nil {
return nil, nil, fmt.Errorf("failed to create denylist cache: %w", err)
}
}
if opts.ExistingAllowlistCache != nil {
allowlistMatcher = opts.ExistingAllowlistCache
} else {
allowlistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeAllowlist,
cfg.Loading, cfg.Allowlists, downloader)
if err != nil {
return nil, nil, fmt.Errorf("failed to create allowlist cache: %w", err)
}
}
downloader := lists.NewDownloader(cfg.Loading.Downloads, bootstrap.NewHTTPTransport())
denylistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeDenylist,
cfg.Loading, cfg.Denylists, downloader)
if err != nil {
return nil, nil, fmt.Errorf("failed to create denylist cache: %w", err)
}
allowlistMatcher, err = lists.NewListCache(ctx, lists.ListCacheTypeAllowlist,
cfg.Loading, cfg.Allowlists, downloader)
if err != nil {
return nil, nil, fmt.Errorf("failed to create allowlist cache: %w", err)
}

Copilot uses AI. Check for mistakes.
Comment thread server/cert_provider.go
Comment on lines +92 to +100
mod := cp.getModTime()
if mod.After(cp.lastMod) {
cp.lastMod = mod
if err := cp.loadCert(); err != nil {
log.PrefixedLog("server").Error("TLS certificate reload failed: ", err)
} else {
log.PrefixedLog("server").Info("TLS certificate reloaded")
}
}

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CertProvider.lastMod is written in poll() without synchronization and also set in UpdatePaths() under cp.mu. Because poll() reads/writes lastMod concurrently with UpdatePaths(), this is a data race under the race detector. Protect lastMod with the same mutex (read/write under lock) or convert it to an atomic value.

Copilot uses AI. Check for mistakes.
Comment thread server/cert_provider.go
Comment on lines +54 to +64
func (cp *CertProvider) loadCert() error {
cp.mu.RLock()
certFile, keyFile := cp.certFile, cp.keyFile
cp.mu.RUnlock()

cert, err := tls.LoadX509KeyPair(certFile, keyFile)
if err != nil {
return err
}
cp.cert.Store(&cert)

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadCert() stores the tls.Certificate returned by tls.LoadX509KeyPair without populating cert.Leaf. Elsewhere in the codebase (e.g. util.TLSGenerateSelfSignedCert) Leaf is explicitly set to avoid parsing on every handshake, and your tests compare/use Leaf. Consider parsing cert.Certificate[0] with x509.ParseCertificate and assigning cert.Leaf before storing.

Copilot uses AI. Check for mistakes.
Comment thread api/api_client.gen.go
Comment on lines +904 to +911
switch {
case strings.Contains(rsp.Header.Get("Content-Type"), "yaml") && rsp.StatusCode == 200:
var dest string
if err := yaml.Unmarshal(bodyBytes, &dest); err != nil {
return nil, err
}
response.YAML200 = &dest

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseGetConfigResponse unmarshals the response body as YAML into a Go string. But the server returns YAML text representing a mapping (the config), so unmarshaling into string will fail ("cannot unmarshal !!map into string"). The client should treat the body as plain text (e.g. YAML200 = string(bodyBytes)) rather than yaml.Unmarshal into a scalar string, or adjust the OpenAPI schema/content type so codegen matches the actual payload.

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +33
func registerReloadTrigger(ctx context.Context, s *Server) {
signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGHUP)

go func() {

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registerReloadTrigger registers with signal.Notify but never calls signal.Stop(signals) on ctx.Done(). This can keep delivering signals to an unconsumed channel after shutdown and retains global signal handlers longer than necessary. Consider deferring signal.Stop(signals) and optionally closing the channel when the goroutine exits.

Copilot uses AI. Check for mistakes.
Comment thread api/api_interface_impl.go
Comment on lines +228 to +238
data, err := yaml.Marshal(cfg)
if err != nil {
return nil, fmt.Errorf("failed to marshal config: %w", err)
}

return GetConfig200TextyamlResponse{
Body: bytes.NewReader(data),
ContentLength: int64(len(data)),
}, nil
}

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetConfig returns the full active config marshaled as YAML. The config struct contains secrets (e.g. redis.password / redis.sentinelPassword), so this endpoint will expose credentials in cleartext to anyone who can reach the HTTP API. Consider redacting secret fields before marshaling (e.g. copy cfg and blank passwords), omitting sensitive sections, or gating the endpoint behind an authentication/authorization mechanism if one exists.

Suggested change
data, err := yaml.Marshal(cfg)
if err != nil {
return nil, fmt.Errorf("failed to marshal config: %w", err)
}
return GetConfig200TextyamlResponse{
Body: bytes.NewReader(data),
ContentLength: int64(len(data)),
}, nil
}
rawData, err := yaml.Marshal(cfg)
if err != nil {
return nil, fmt.Errorf("failed to marshal config: %w", err)
}
redactedData, err := redactYAMLSecrets(rawData)
if err != nil {
return nil, fmt.Errorf("failed to redact config: %w", err)
}
return GetConfig200TextyamlResponse{
Body: bytes.NewReader(redactedData),
ContentLength: int64(len(redactedData)),
}, nil
}
// redactYAMLSecrets takes a YAML document and returns a copy with likely secret
// values (e.g. passwords, tokens) replaced by a placeholder string.
func redactYAMLSecrets(data []byte) ([]byte, error) {
var v interface{}
if err := yaml.Unmarshal(data, &v); err != nil {
return nil, err
}
redactNode(&v)
return yaml.Marshal(v)
}
// redactNode walks an arbitrary YAML-decoded structure and redacts values of
// keys that look sensitive.
func redactNode(v *interface{}) {
switch node := (*v).(type) {
case map[interface{}]interface{}:
for k, val := range node {
keyStr := fmt.Sprint(k)
if isSensitiveKey(keyStr) {
node[k] = "***redacted***"
continue
}
sub := val
redactNode(&sub)
node[k] = sub
}
case map[string]interface{}:
for k, val := range node {
if isSensitiveKey(k) {
node[k] = "***redacted***"
continue
}
sub := val
redactNode(&sub)
node[k] = sub
}
case []interface{}:
for i, val := range node {
sub := val
redactNode(&sub)
node[i] = sub
}
default:
// scalars are left as-is
}
}
// isSensitiveKey reports whether a key name likely refers to a secret value.
func isSensitiveKey(key string) bool {
l := strings.ToLower(key)
return strings.Contains(l, "password") ||
strings.Contains(l, "passwd") ||
strings.Contains(l, "secret") ||
strings.Contains(l, "token") ||
strings.Contains(l, "apikey") ||
strings.Contains(l, "api_key")
}

Copilot uses AI. Check for mistakes.
Comment thread config/watcher.go Outdated
Comment on lines +98 to +103
if event.Has(fsnotify.Write) || event.Has(fsnotify.Create) {
logger.Info("config file change detected (fsnotify)")
if err := onReload(); err != nil {
logger.Errorf("config reload failed: %v", err)
}
}

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startFsnotify() only triggers reload on fsnotify.Write or fsnotify.Create. Many common “atomic save” patterns (and Kubernetes ConfigMap updates) replace the config via rename/symlink swap, which typically emits fsnotify.Rename/Remove events for the target file rather than Write. If polling is disabled (interval=0), these updates may never trigger a reload. Consider also handling Rename/Remove (and filtering by event.Name == configPath when watching a directory).

Copilot uses AI. Check for mistakes.
0xERR0R added 2 commits March 20, 2026 14:50
- Don't reuse ListCache across reloads (periodic refresh context dies)
- Derive resolver context from server root context (fix goroutine leak)
- Move log.Configure after successful reload (fix premature log switch)
- Redact secrets in GET /api/config response
- Handle fsnotify Rename events (atomic saves, K8s ConfigMaps)
- Fix CertProvider.lastMod data race
- Add signal.Stop cleanup in signal handlers
- Populate cert.Leaf for performance
- Filter fsnotify events to actual config file, not entire directory
- Fix cooldown race using UnixMilli and CompareAndSwap
- Add 30s timeout to FQDN IP cache init during reload
- Add 5s grace period before cancelling old resolver context
- Add unit tests for redactSecrets function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blocky should have a reload option to re-read the configuration [Feature Request] Automatically reload config

2 participants