Skip to content

fix(dp-server): bound shutdown, propagate appCtx#16541

Merged
bartsmykla merged 3 commits into
kumahq:masterfrom
bartsmykla:fix/dp-server-graceful-shutdown
May 12, 2026
Merged

fix(dp-server): bound shutdown, propagate appCtx#16541
bartsmykla merged 3 commits into
kumahq:masterfrom
bartsmykla:fix/dp-server-graceful-shutdown

Conversation

@bartsmykla
Copy link
Copy Markdown
Contributor

@bartsmykla bartsmykla commented May 11, 2026

Motivation

Remote control planes were crashing on SIGTERM under load. controller-runtime's manager logs:

failed waiting for all runnables to end within grace period of 30s: context deadline exceeded

There are two compounding root causes in the dp-server stack:

  1. DpServer.Start called http.Server.Shutdown(context.Background()) - unbounded.
  2. SoTW, Delta, and DataplaneSyncTracker watchdog contexts were all rooted in context.Background(), so xDS handlers never observed app shutdown.

http.Shutdown waited forever for xDS streams whose handlers never exited, controller-runtime's 30s grace expired, and the pod exited non-zero.

Implementation information

DpServer.Start now wraps the shutdown:

  • bounded with dp_server.gracefulShutdownTimeout (default 10s, env KUMA_DP_SERVER_GRACEFUL_SHUTDOWN_TIMEOUT)
  • http.Server.Shutdown(ctx) first
  • grpcServer.Stop() second as a backstop for handlers that ignored the ctx cancel

Why Stop and not GracefulStop: gRPC is mounted via ServeHTTP (see pkg/dp-server/server/server.go handle), so its transport is serverHandlerTransport. GracefulStop routes through drainAllServerTransportsLocked, which calls st.Drain("graceful_stop") on every transport. On serverHandlerTransport, Drain is a hard panic (panic("Drain() is not implemented")). So GracefulStop would crash the process during shutdown. Stop skips Drain and closes the transports directly, which is what we want as a backstop.

DpServer now implements GracefulComponent so the component manager waits for Start to return before exiting.

Contexts for SoTW, Delta, the dataplane sync tracker watchdog, and the HDS server all derive from rt.AppContext() now, so the kuma-cp signal handler propagates straight through to the handler loops.

NewDpServer returns (*DpServer, error) since it registers metrics that can fail at boot. Matches the pattern used by pkg/metrics/store/, pkg/plugins/resources/postgres/pgx_store.go, etc.

New metrics

  • dp_server_force_stop_total{reason} counter. Reasons are fixed at compile time: shutdown_timeout, http_error.
  • dp_server_shutdown_duration_seconds histogram with buckets sized around the 10s default and 30s outer bound.

@bartsmykla bartsmykla added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label May 11, 2026
@bartsmykla bartsmykla requested a review from Copilot May 11, 2026 11:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes dp-server shutdown behavior under load by ensuring shutdown is time-bounded and by rooting xDS/HDS long-lived handler contexts in the runtime app context so they observe SIGTERM cancellation. This aligns dp-server/component-manager shutdown with controller-runtime’s grace period and prevents pods from exiting non-zero due to hung streams.

Changes:

  • Add dp_server.gracefulShutdownTimeout (default 10s) and use it to bound http.Server.Shutdown, then force-stop gRPC as a backstop.
  • Propagate rt.AppContext() into SoTW/Delta xDS servers, the dataplane sync tracker watchdog, and HDS server wiring.
  • Change server.NewDpServer to return (*DpServer, error) (metric registration can fail) and add unit tests for shutdown bounding / WaitForDone behavior.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/xds/server/v3/components.go Root SoTW/Delta server contexts in rt.AppContext(); pass app ctx into sync tracker.
pkg/xds/server/callbacks/dataplane_sync_tracker.go Make watchdog contexts derive from an injected parent (app) context.
pkg/xds/server/callbacks/dataplane_sync_tracker_test.go Update constructor usage; add test ensuring watchdog is canceled by parent context.
pkg/xds/bootstrap/server_test.go Adapt to NewDpServer(...)(*DpServer, error) and set shutdown timeout in test config.
pkg/test/runtime/runtime.go Handle NewDpServer error and pass the constructed server into the runtime builder.
pkg/hds/hds_suite_test.go Add Ginkgo suite bootstrap for the new HDS test package.
pkg/hds/components.go Wire HDS server context to rt.AppContext() (via user.Ctx(...)).
pkg/hds/components_test.go Add test ensuring HDS server receives a context derived from the runtime app context.
pkg/dp-server/server/server.go Implement bounded shutdown + metrics; implement GracefulComponent via WaitForDone; make constructor return error.
pkg/dp-server/server/server_test.go Add unit tests for shutdown bounding, Stop sequencing, WaitForDone, and refusing double Start.
pkg/dp-server/server/server_suite_test.go Add Ginkgo suite bootstrap for dp-server server tests.
pkg/core/bootstrap/bootstrap.go Handle NewDpServer error and pass constructed server into the runtime builder.
pkg/config/loader_test.go Extend loader tests to cover gracefulShutdownTimeout parsing and env var override.
pkg/config/dp-server/config.go Add GracefulShutdownTimeout field, validation, and defaults.
pkg/config/app/kuma-cp/kuma-cp.defaults.yaml Add default gracefulShutdownTimeout and document env var.
docs/generated/raw/kuma-cp.yaml Update generated config docs with the new gracefulShutdownTimeout field.

Comment thread pkg/dp-server/server/server.go
Comment thread pkg/config/dp-server/config.go Outdated
Comment thread pkg/config/app/kuma-cp/kuma-cp.defaults.yaml Outdated
@bartsmykla bartsmykla force-pushed the fix/dp-server-graceful-shutdown branch from 7f67f44 to 0e1ff03 Compare May 11, 2026 12:12
@github-actions
Copy link
Copy Markdown
Contributor

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

@bartsmykla bartsmykla force-pushed the fix/dp-server-graceful-shutdown branch 2 times, most recently from 8b5225e to 4dffc47 Compare May 11, 2026 12:30
http.Server.Shutdown(context.Background()) was unbounded, and the xDS
SoTW/Delta servers + watchdog goroutines were rooted in
context.Background(), so xDS handlers never observed app shutdown.
Under load this caused controller-runtime's 30s drain budget to expire
and kuma-cp to exit non-zero on SIGTERM.

DpServer.Start now bounds http.Shutdown with a configurable
GracefulShutdownTimeout (default 10s, env
KUMA_DP_SERVER_GRACEFUL_SHUTDOWN_TIMEOUT), then force-stops the gRPC
server. grpc-go's GracefulStop is unsafe here because
serverHandlerTransport.Drain panics on ServeHTTP-mounted servers.
DpServer also implements GracefulComponent so the manager waits for it
to finish before exiting.

xDS SoTW, Delta, the dataplane sync tracker watchdog ctx, and the HDS
server now derive their contexts from rt.AppContext() so handlers exit
on app shutdown.

New metrics: dp_server_force_stop_total{reason=timeout|error} counter
and dp_server_shutdown_duration_seconds histogram. NewDpServer now
returns (*DpServer, error) to surface metric registration failures,
matching the pattern used by other metric-registering constructors.

Signed-off-by: Bart Smykla <bartek@smykla.com>
@bartsmykla bartsmykla force-pushed the fix/dp-server-graceful-shutdown branch from 4dffc47 to a1bd9d8 Compare May 11, 2026 12:34
@bartsmykla bartsmykla requested a review from Copilot May 11, 2026 12:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.

Comment thread pkg/dp-server/server/server.go
Comment thread pkg/dp-server/server/server.go Outdated
Comment thread pkg/dp-server/server/server.go
Comment thread pkg/dp-server/server/server.go
Comment thread pkg/dp-server/server/server.go Outdated
Comment thread pkg/dp-server/server/server.go
Comment thread pkg/dp-server/server/server.go Outdated
- Return error only for unexpected shutdown failures; keep nil for
  clean drain and DeadlineExceeded so the bounded grace period stays
  silent in the manager's errCh.
- Drop DeadlineExceeded log to INFO to avoid false-positive alerts on
  ERROR-level kuma-cp lines; ERROR stays for the unexpected branch.
- Buffer errChan so a late ServeTLS error after force-stop cannot
  leak the goroutine when <-stop already won the outer select.
- Refine shutdown_duration histogram buckets around the 10s default
  for sub-second and near-deadline resolution.
- Document gracefulShutdownTimeout in UPGRADE.md so operators with
  bumped terminationGracePeriodSeconds tune it in lockstep.

Signed-off-by: Bart Smykla <bartek@smykla.com>
Automaat
Automaat previously approved these changes May 12, 2026
…ceful-shutdown

Signed-off-by: Bart Smykla <bartek@smykla.com>

# Conflicts:
#	UPGRADE.md
bartsmykla added a commit that referenced this pull request May 12, 2026
…#16568)

Automatic cherry-pick of #16541 for branch release-2.9.

Generated by
[action](https://github.com/kumahq/kuma/actions/runs/25736177605);
conflicts resolved manually.

---------

Signed-off-by: Bart Smykla <bartek@smykla.com>
Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla added a commit that referenced this pull request May 12, 2026
…#16566)

Automatic cherry-pick of #16541 for branch release-2.11.

Generated by
[action](https://github.com/kumahq/kuma/actions/runs/25736177605);
conflicts resolved manually.

---------

Signed-off-by: Bart Smykla <bartek@smykla.com>
Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla added a commit that referenced this pull request May 12, 2026
…#16567)

Automatic cherry-pick of #16541 for branch release-2.12.

Generated by
[action](https://github.com/kumahq/kuma/actions/runs/25736177605);
conflicts resolved manually.

---------

Signed-off-by: Bart Smykla <bartek@smykla.com>
Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla added a commit that referenced this pull request May 13, 2026
…#16569)

Automatic cherry-pick of #16541 for branch release-2.7.

Generated by
[action](https://github.com/kumahq/kuma/actions/runs/25736177605);
conflicts resolved manually.

---------

Signed-off-by: Bart Smykla <bartek@smykla.com>
Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla added a commit that referenced this pull request May 13, 2026
…#16565)

Automatic cherry-pick of #16541 for branch release-2.13.

Generated by
[action](https://github.com/kumahq/kuma/actions/runs/25736177605);
conflicts resolved manually.

---------

Signed-off-by: Bart Smykla <bartek@smykla.com>
Signed-off-by: Bart Smykla <bartek+claude.ai@smykla.com>
Co-authored-by: Bart Smykla <bartek@smykla.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants