Skip to content

feat(xds): warn on stream close after large send#16784

Open
Automaat wants to merge 1 commit into
kumahq:masterfrom
Automaat:feat/xds-detect-stream-closure-after-large-send
Open

feat(xds): warn on stream close after large send#16784
Automaat wants to merge 1 commit into
kumahq:masterfrom
Automaat:feat/xds-detect-stream-closure-after-large-send

Conversation

@Automaat

Copy link
Copy Markdown
Contributor

Motivation

When an xDS DiscoveryResponse exceeds gRPC C-Core's
grpc.max_receive_message_length (default 4 MiB, which in the google_grpc
transport also sizes the per-stream HTTP/2 receive flow-control window),
the client silently cancels the stream a few hundred milliseconds after
dispatch. The control plane today sees only "stream closed" and operators
have to diagnose it from secondary symptoms (warming clusters, missing
endpoints). This change surfaces the failure directly via a metric and a
log line, so the pattern is visible the moment it happens. Referenced
upstream issue: #16355.

Implementation information

  • In pkg/util/xds/stats_callbacks.go, track the most recent response
    size, timestamp, and type_url per stream alongside the existing
    per-stream state. Update happens on OnStreamResponse and
    OnStreamDeltaResponse after the existing responses_sent increment;
    no behavior change to existing metrics.
  • On OnStreamClosed / OnDeltaStreamClosed: if the stream closed
    within 5s of sending a >=1 MiB response, increment counter
    <dsType>_stream_likely_window_depletion_total{type_url=...} and emit
    a log line at info level (logr convention) pointing operators at
    KUMA_BOOTSTRAP_SERVER_PARAMS_XDS_GRPC_MAX_RECEIVE_MESSAGE_BYTES.
  • Defaults: 5s closure window, 1 MiB size threshold. Window leaves
    headroom over the few-hundred-ms cancel delay reported in the field;
    threshold sits well below the 4 MiB gRPC default to also catch
    configurations that already tightened the limit.
  • Response size is computed via proto.Size on the underlying
    Discovery(Delta)Response, with a fallback that sums resources.
  • Tests cover: large send + close-in-window fires the counter; small
    send does not; large send + close-after-window does not; no send +
    close does not; delta variant works the same way.

Changelog: feat(xds): warn on stream close after large send

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

gRPC C-Core sizes its per-stream HTTP/2 receive flow-control
window from grpc.max_receive_message_length. When an xDS
DiscoveryResponse exceeds that window the client silently
cancels the stream a few hundred milliseconds after dispatch,
leaving the control plane with no signal beyond "stream
closed" - operators today only see it via secondary symptoms
(warming clusters, missing endpoints).

Track the last response sent on each stream and, when the
stream closes within 5s of a >=1 MiB send, increment a new
counter xds_stream_likely_window_depletion_total{type_url}
and emit a log line pointing operators at the gRPC
receive-window limit. Existing metrics (responses_sent, etc.)
stay unconditional.

Defaults: 5s window leaves headroom over the few-hundred-ms
cancel delay without inflating false positives; 1 MiB
threshold triggers well below the 4 MiB gRPC default so we
also catch configurations that tightened the limit.

Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
@Automaat Automaat force-pushed the feat/xds-detect-stream-closure-after-large-send branch from ff55cb3 to 31f7f96 Compare May 29, 2026 08:45

@lahabana lahabana left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting but I'm a little worried of the complexity here.

Could we consider doing something simpler:

  • monitor closed xds streams (I think we possibly have this already)
  • log when sending response over 2MB (with the DP name). A bit like a slow query logger in sql

}
last, hadSend := s.lastSendByStream[streamID]
delete(s.lastSendByStream, streamID)
s.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMs are overly sensitive at logging inside critical sections. IMO I've seen more issues with missing/misplaced release than performance issues with logging inside a critical sections

Therefore, I think keep the defer is ok and less likely to cause issue than having it here.

sentAt: core.Now(),
typeURL: typeURL,
}
s.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer s.Unlock please

statsLogger.Info(
"xds stream closed within window of sending a large response; likely caller-side gRPC receive-window depletion (kumahq/kuma#16355). "+
"If using google_grpc xDS transport, raise KUMA_BOOTSTRAP_SERVER_PARAMS_XDS_GRPC_MAX_RECEIVE_MESSAGE_BYTES.",
"streamID", streamID,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamID isn't going to be super useful to user. Can we get the DP name?

@github-actions

Copy link
Copy Markdown
Contributor

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants