feat(xds): warn on stream close after large send#16784
Open
Automaat wants to merge 1 commit into
Open
Conversation
gRPC C-Core sizes its per-stream HTTP/2 receive flow-control
window from grpc.max_receive_message_length. When an xDS
DiscoveryResponse exceeds that window the client silently
cancels the stream a few hundred milliseconds after dispatch,
leaving the control plane with no signal beyond "stream
closed" - operators today only see it via secondary symptoms
(warming clusters, missing endpoints).
Track the last response sent on each stream and, when the
stream closes within 5s of a >=1 MiB send, increment a new
counter xds_stream_likely_window_depletion_total{type_url}
and emit a log line pointing operators at the gRPC
receive-window limit. Existing metrics (responses_sent, etc.)
stay unconditional.
Defaults: 5s window leaves headroom over the few-hundred-ms
cancel delay without inflating false positives; 1 MiB
threshold triggers well below the 4 MiB gRPC default so we
also catch configurations that tightened the limit.
Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
ff55cb3 to
31f7f96
Compare
lahabana
reviewed
May 29, 2026
lahabana
left a comment
Contributor
There was a problem hiding this comment.
That's interesting but I'm a little worried of the complexity here.
Could we consider doing something simpler:
- monitor closed xds streams (I think we possibly have this already)
- log when sending response over 2MB (with the DP name). A bit like a slow query logger in sql
lahabana
reviewed
May 29, 2026
| } | ||
| last, hadSend := s.lastSendByStream[streamID] | ||
| delete(s.lastSendByStream, streamID) | ||
| s.Unlock() |
Contributor
There was a problem hiding this comment.
LLMs are overly sensitive at logging inside critical sections. IMO I've seen more issues with missing/misplaced release than performance issues with logging inside a critical sections
Therefore, I think keep the defer is ok and less likely to cause issue than having it here.
| sentAt: core.Now(), | ||
| typeURL: typeURL, | ||
| } | ||
| s.Unlock() |
| statsLogger.Info( | ||
| "xds stream closed within window of sending a large response; likely caller-side gRPC receive-window depletion (kumahq/kuma#16355). "+ | ||
| "If using google_grpc xDS transport, raise KUMA_BOOTSTRAP_SERVER_PARAMS_XDS_GRPC_MAX_RECEIVE_MESSAGE_BYTES.", | ||
| "streamID", streamID, |
Contributor
There was a problem hiding this comment.
streamID isn't going to be super useful to user. Can we get the DP name?
Contributor
Reviewer Checklist🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
|
bartsmykla
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When an xDS DiscoveryResponse exceeds gRPC C-Core's
grpc.max_receive_message_length(default 4 MiB, which in the google_grpctransport also sizes the per-stream HTTP/2 receive flow-control window),
the client silently cancels the stream a few hundred milliseconds after
dispatch. The control plane today sees only "stream closed" and operators
have to diagnose it from secondary symptoms (warming clusters, missing
endpoints). This change surfaces the failure directly via a metric and a
log line, so the pattern is visible the moment it happens. Referenced
upstream issue: #16355.
Implementation information
pkg/util/xds/stats_callbacks.go, track the most recent responsesize, timestamp, and
type_urlper stream alongside the existingper-stream state. Update happens on
OnStreamResponseandOnStreamDeltaResponseafter the existingresponses_sentincrement;no behavior change to existing metrics.
OnStreamClosed/OnDeltaStreamClosed: if the stream closedwithin 5s of sending a >=1 MiB response, increment counter
<dsType>_stream_likely_window_depletion_total{type_url=...}and emita log line at info level (logr convention) pointing operators at
KUMA_BOOTSTRAP_SERVER_PARAMS_XDS_GRPC_MAX_RECEIVE_MESSAGE_BYTES.headroom over the few-hundred-ms cancel delay reported in the field;
threshold sits well below the 4 MiB gRPC default to also catch
configurations that already tightened the limit.
proto.Sizeon the underlyingDiscovery(Delta)Response, with a fallback that sums resources.
send does not; large send + close-after-window does not; no send +
close does not; delta variant works the same way.