fix(dp-server): bound shutdown, propagate appCtx#16541
Merged
bartsmykla merged 3 commits intoMay 12, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes dp-server shutdown behavior under load by ensuring shutdown is time-bounded and by rooting xDS/HDS long-lived handler contexts in the runtime app context so they observe SIGTERM cancellation. This aligns dp-server/component-manager shutdown with controller-runtime’s grace period and prevents pods from exiting non-zero due to hung streams.
Changes:
- Add
dp_server.gracefulShutdownTimeout(default 10s) and use it to boundhttp.Server.Shutdown, then force-stop gRPC as a backstop. - Propagate
rt.AppContext()into SoTW/Delta xDS servers, the dataplane sync tracker watchdog, and HDS server wiring. - Change
server.NewDpServerto return(*DpServer, error)(metric registration can fail) and add unit tests for shutdown bounding / WaitForDone behavior.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/xds/server/v3/components.go | Root SoTW/Delta server contexts in rt.AppContext(); pass app ctx into sync tracker. |
| pkg/xds/server/callbacks/dataplane_sync_tracker.go | Make watchdog contexts derive from an injected parent (app) context. |
| pkg/xds/server/callbacks/dataplane_sync_tracker_test.go | Update constructor usage; add test ensuring watchdog is canceled by parent context. |
| pkg/xds/bootstrap/server_test.go | Adapt to NewDpServer(...)(*DpServer, error) and set shutdown timeout in test config. |
| pkg/test/runtime/runtime.go | Handle NewDpServer error and pass the constructed server into the runtime builder. |
| pkg/hds/hds_suite_test.go | Add Ginkgo suite bootstrap for the new HDS test package. |
| pkg/hds/components.go | Wire HDS server context to rt.AppContext() (via user.Ctx(...)). |
| pkg/hds/components_test.go | Add test ensuring HDS server receives a context derived from the runtime app context. |
| pkg/dp-server/server/server.go | Implement bounded shutdown + metrics; implement GracefulComponent via WaitForDone; make constructor return error. |
| pkg/dp-server/server/server_test.go | Add unit tests for shutdown bounding, Stop sequencing, WaitForDone, and refusing double Start. |
| pkg/dp-server/server/server_suite_test.go | Add Ginkgo suite bootstrap for dp-server server tests. |
| pkg/core/bootstrap/bootstrap.go | Handle NewDpServer error and pass constructed server into the runtime builder. |
| pkg/config/loader_test.go | Extend loader tests to cover gracefulShutdownTimeout parsing and env var override. |
| pkg/config/dp-server/config.go | Add GracefulShutdownTimeout field, validation, and defaults. |
| pkg/config/app/kuma-cp/kuma-cp.defaults.yaml | Add default gracefulShutdownTimeout and document env var. |
| docs/generated/raw/kuma-cp.yaml | Update generated config docs with the new gracefulShutdownTimeout field. |
7f67f44 to
0e1ff03
Compare
Contributor
Reviewer Checklist🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
|
8b5225e to
4dffc47
Compare
http.Server.Shutdown(context.Background()) was unbounded, and the xDS
SoTW/Delta servers + watchdog goroutines were rooted in
context.Background(), so xDS handlers never observed app shutdown.
Under load this caused controller-runtime's 30s drain budget to expire
and kuma-cp to exit non-zero on SIGTERM.
DpServer.Start now bounds http.Shutdown with a configurable
GracefulShutdownTimeout (default 10s, env
KUMA_DP_SERVER_GRACEFUL_SHUTDOWN_TIMEOUT), then force-stops the gRPC
server. grpc-go's GracefulStop is unsafe here because
serverHandlerTransport.Drain panics on ServeHTTP-mounted servers.
DpServer also implements GracefulComponent so the manager waits for it
to finish before exiting.
xDS SoTW, Delta, the dataplane sync tracker watchdog ctx, and the HDS
server now derive their contexts from rt.AppContext() so handlers exit
on app shutdown.
New metrics: dp_server_force_stop_total{reason=timeout|error} counter
and dp_server_shutdown_duration_seconds histogram. NewDpServer now
returns (*DpServer, error) to surface metric registration failures,
matching the pattern used by other metric-registering constructors.
Signed-off-by: Bart Smykla <bartek@smykla.com>
4dffc47 to
a1bd9d8
Compare
Automaat
reviewed
May 12, 2026
- Return error only for unexpected shutdown failures; keep nil for clean drain and DeadlineExceeded so the bounded grace period stays silent in the manager's errCh. - Drop DeadlineExceeded log to INFO to avoid false-positive alerts on ERROR-level kuma-cp lines; ERROR stays for the unexpected branch. - Buffer errChan so a late ServeTLS error after force-stop cannot leak the goroutine when <-stop already won the outer select. - Refine shutdown_duration histogram buckets around the 10s default for sub-second and near-deadline resolution. - Document gracefulShutdownTimeout in UPGRADE.md so operators with bumped terminationGracePeriodSeconds tune it in lockstep. Signed-off-by: Bart Smykla <bartek@smykla.com>
Automaat
previously approved these changes
May 12, 2026
…ceful-shutdown Signed-off-by: Bart Smykla <bartek@smykla.com> # Conflicts: # UPGRADE.md
Automaat
approved these changes
May 12, 2026
This was referenced May 12, 2026
bartsmykla
added a commit
that referenced
this pull request
May 12, 2026
…#16568) Automatic cherry-pick of #16541 for branch release-2.9. Generated by [action](https://github.com/kumahq/kuma/actions/runs/25736177605); conflicts resolved manually. --------- Signed-off-by: Bart Smykla <bartek@smykla.com> Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla
added a commit
that referenced
this pull request
May 12, 2026
…#16566) Automatic cherry-pick of #16541 for branch release-2.11. Generated by [action](https://github.com/kumahq/kuma/actions/runs/25736177605); conflicts resolved manually. --------- Signed-off-by: Bart Smykla <bartek@smykla.com> Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla
added a commit
that referenced
this pull request
May 12, 2026
…#16567) Automatic cherry-pick of #16541 for branch release-2.12. Generated by [action](https://github.com/kumahq/kuma/actions/runs/25736177605); conflicts resolved manually. --------- Signed-off-by: Bart Smykla <bartek@smykla.com> Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla
added a commit
that referenced
this pull request
May 13, 2026
…#16569) Automatic cherry-pick of #16541 for branch release-2.7. Generated by [action](https://github.com/kumahq/kuma/actions/runs/25736177605); conflicts resolved manually. --------- Signed-off-by: Bart Smykla <bartek@smykla.com> Co-authored-by: Bart Smykla <bartek@smykla.com>
bartsmykla
added a commit
that referenced
this pull request
May 13, 2026
…#16565) Automatic cherry-pick of #16541 for branch release-2.13. Generated by [action](https://github.com/kumahq/kuma/actions/runs/25736177605); conflicts resolved manually. --------- Signed-off-by: Bart Smykla <bartek@smykla.com> Signed-off-by: Bart Smykla <bartek+claude.ai@smykla.com> Co-authored-by: Bart Smykla <bartek@smykla.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Remote control planes were crashing on SIGTERM under load. controller-runtime's manager logs:
There are two compounding root causes in the dp-server stack:
DpServer.Startcalledhttp.Server.Shutdown(context.Background())- unbounded.context.Background(), so xDS handlers never observed app shutdown.http.Shutdownwaited forever for xDS streams whose handlers never exited, controller-runtime's 30s grace expired, and the pod exited non-zero.Implementation information
DpServer.Startnow wraps the shutdown:dp_server.gracefulShutdownTimeout(default 10s, envKUMA_DP_SERVER_GRACEFUL_SHUTDOWN_TIMEOUT)http.Server.Shutdown(ctx)firstgrpcServer.Stop()second as a backstop for handlers that ignored the ctx cancelWhy
Stopand notGracefulStop: gRPC is mounted viaServeHTTP(seepkg/dp-server/server/server.gohandle), so its transport isserverHandlerTransport.GracefulStoproutes throughdrainAllServerTransportsLocked, which callsst.Drain("graceful_stop")on every transport. OnserverHandlerTransport,Drainis a hard panic (panic("Drain() is not implemented")). SoGracefulStopwould crash the process during shutdown.StopskipsDrainand closes the transports directly, which is what we want as a backstop.DpServernow implementsGracefulComponentso the component manager waits forStartto return before exiting.Contexts for SoTW, Delta, the dataplane sync tracker watchdog, and the HDS server all derive from
rt.AppContext()now, so the kuma-cp signal handler propagates straight through to the handler loops.NewDpServerreturns(*DpServer, error)since it registers metrics that can fail at boot. Matches the pattern used bypkg/metrics/store/,pkg/plugins/resources/postgres/pgx_store.go, etc.New metrics
dp_server_force_stop_total{reason}counter. Reasons are fixed at compile time:shutdown_timeout,http_error.dp_server_shutdown_duration_secondshistogram with buckets sized around the 10s default and 30s outer bound.