Skip to content

test: harden UDP packetizer integration test against flakiness#1176

Merged
ruck314 merged 5 commits into
pre-releasefrom
ESROGUE-730-flaky-udp-test
Apr 13, 2026
Merged

test: harden UDP packetizer integration test against flakiness#1176
ruck314 merged 5 commits into
pre-releasefrom
ESROGUE-730-flaky-udp-test

Conversation

@ruck314

@ruck314 ruck314 commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace wall-clock DRAIN_TIMEOUT / reorder-settle timeouts in tests/integration/test_udp_packetizer_integration.py with progress-based polling (wait_for_progress). As long as the monitored counter keeps advancing, the test keeps waiting — so slow CI runners no longer produce spurious failures. It only bails if the counter stops advancing for STALL_WINDOW = 5.0s.
  • Wrap the body in try/finally with deterministic, ordered teardown: flush the reordering stage, stop client then server RSSI, brief unwind sleep, then drop references in reverse link order. This avoids the worker-crash signature observed in https://github.com/slaclab/rogue/actions/runs/24151542407 (C++ use-after-free during Python teardown when upstream stages outlive downstream slaves).
  • Close the reorder-drain race: wait until every generated frame has been observed by the reordering stage (out_of_order.cnt >= FRAME_COUNT) before setting period = 0, so the flush in the setter cannot miss a frame that gets cached a moment later.
  • Add a small public cnt property on RssiOutOfOrder to expose the observation counter safely.
  • Bump CONNECTION_TIMEOUT to 30s (RSSI open is a binary handshake with no incremental progress to poll).

Test plan

  • Looped test_data_path locally 30x — 30 passed / 0 failed.
  • Full non-perf test suite (python -m pytest -n auto --dist loadfile -m "not perf") run locally — all 4 test_data_path parametrizations pass. (Unrelated pre-existing env failures on softioc/pydm/zmq-port were reproduced on unmodified pre-release and are not from this change.)
  • ./scripts/run_linters.sh — clean.
  • GitHub Actions "Full Build Test / Rogue Tests" green on this PR.

Replaces wall-clock timeouts with progress-based polling so slow CI
runners do not spuriously fail, adds deterministic teardown (try/finally
with ordered shutdown and reverse-order reference drops) to avoid C++
use-after-free on worker exit, and closes a reorder-drain race by
waiting until every generated frame has reached the reorder stage
before flushing its cache.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the UDP/RSSI/packetizer integration test by replacing fixed wall-clock drain timeouts with progress-based polling and by making teardown more deterministic to avoid CI flakiness and teardown-related crashes.

Changes:

  • Replaced fixed drain/reorder timeouts with wait_for_progress() using a stall window to detect true stalls.
  • Added RssiOutOfOrder.cnt to safely expose the observation counter used for progress polling.
  • Wrapped the test path in try/finally and reordered teardown to reduce use-after-free risk; increased RSSI connection timeout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/integration/test_udp_packetizer_integration.py Outdated
Comment thread tests/integration/test_udp_packetizer_integration.py
- Track server_started/client_started separately so a partial RSSI
  start (server up, client _start raises) still tears down the running
  endpoint in finally instead of leaking it into later tests.
- Assert prbs_rx.getRxCount() == FRAME_COUNT after the drain wait.
  wait_for_progress returns on >=, so without this an unexpected
  duplicated/replayed frame would slip through as long as PRBS errors
  were zero.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/integration/test_udp_packetizer_integration.py Outdated
Comment thread tests/integration/test_udp_packetizer_integration.py Outdated
wait_for_progress and the RSSI open-deadline loop tracked elapsed time
with time.time(), which is wall-clock and subject to NTP jumps. A
backward step could extend the stall window arbitrarily; a forward step
could trigger a spurious stall failure or premature RSSI timeout.
Switch both sites to time.monotonic().

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/integration/test_udp_packetizer_integration.py Outdated
out_of_order counts RSSI transport segments, not PRBS application frames.
On a standard-MTU link a 2048-byte frame splits into multiple segments, so
out_of_order.cnt >= FRAME_COUNT would return before every generated frame
was observed, reintroducing the reorder-drain race. Compute the expected
segment count from the UDP max payload and packetizer head/tail overhead
and wait on that instead.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/integration/test_udp_packetizer_integration.py
out_of_order.cnt counts every outbound RSSI transport segment, including
SYN/ACK handshake and NULL keepalive control frames. The reorder-settle
wait used an absolute target derived only from data segments, so the
prior control-frame count could let it return early and reintroduce the
flush race the wait was added to close. Snapshot cnt immediately before
generating frames and wait for baseline + expected_segments instead.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ruck314 ruck314 requested a review from bengineerd April 8, 2026 21:23
@ruck314 ruck314 merged commit 8516c16 into pre-release Apr 13, 2026
11 checks passed
@ruck314 ruck314 deleted the ESROGUE-730-flaky-udp-test branch April 13, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants