test: harden UDP packetizer integration test against flakiness#1176
Conversation
Replaces wall-clock timeouts with progress-based polling so slow CI runners do not spuriously fail, adds deterministic teardown (try/finally with ordered shutdown and reverse-order reference drops) to avoid C++ use-after-free on worker exit, and closes a reorder-drain race by waiting until every generated frame has reached the reorder stage before flushing its cache.
There was a problem hiding this comment.
Pull request overview
This PR hardens the UDP/RSSI/packetizer integration test by replacing fixed wall-clock drain timeouts with progress-based polling and by making teardown more deterministic to avoid CI flakiness and teardown-related crashes.
Changes:
- Replaced fixed drain/reorder timeouts with
wait_for_progress()using a stall window to detect true stalls. - Added
RssiOutOfOrder.cntto safely expose the observation counter used for progress polling. - Wrapped the test path in
try/finallyand reordered teardown to reduce use-after-free risk; increased RSSI connection timeout.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Track server_started/client_started separately so a partial RSSI start (server up, client _start raises) still tears down the running endpoint in finally instead of leaking it into later tests. - Assert prbs_rx.getRxCount() == FRAME_COUNT after the drain wait. wait_for_progress returns on >=, so without this an unexpected duplicated/replayed frame would slip through as long as PRBS errors were zero.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wait_for_progress and the RSSI open-deadline loop tracked elapsed time with time.time(), which is wall-clock and subject to NTP jumps. A backward step could extend the stall window arbitrarily; a forward step could trigger a spurious stall failure or premature RSSI timeout. Switch both sites to time.monotonic().
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
out_of_order counts RSSI transport segments, not PRBS application frames. On a standard-MTU link a 2048-byte frame splits into multiple segments, so out_of_order.cnt >= FRAME_COUNT would return before every generated frame was observed, reintroducing the reorder-drain race. Compute the expected segment count from the UDP max payload and packetizer head/tail overhead and wait on that instead.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
out_of_order.cnt counts every outbound RSSI transport segment, including SYN/ACK handshake and NULL keepalive control frames. The reorder-settle wait used an absolute target derived only from data segments, so the prior control-frame count could let it return early and reintroduce the flush race the wait was added to close. Snapshot cnt immediately before generating frames and wait for baseline + expected_segments instead.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
DRAIN_TIMEOUT/ reorder-settle timeouts intests/integration/test_udp_packetizer_integration.pywith progress-based polling (wait_for_progress). As long as the monitored counter keeps advancing, the test keeps waiting — so slow CI runners no longer produce spurious failures. It only bails if the counter stops advancing forSTALL_WINDOW = 5.0s.try/finallywith deterministic, ordered teardown: flush the reordering stage, stop client then server RSSI, brief unwind sleep, then drop references in reverse link order. This avoids the worker-crash signature observed in https://github.com/slaclab/rogue/actions/runs/24151542407 (C++ use-after-free during Python teardown when upstream stages outlive downstream slaves).out_of_order.cnt >= FRAME_COUNT) before settingperiod = 0, so the flush in the setter cannot miss a frame that gets cached a moment later.cntproperty onRssiOutOfOrderto expose the observation counter safely.CONNECTION_TIMEOUTto 30s (RSSI open is a binary handshake with no incremental progress to poll).Test plan
test_data_pathlocally 30x — 30 passed / 0 failed.python -m pytest -n auto --dist loadfile -m "not perf") run locally — all 4test_data_pathparametrizations pass. (Unrelated pre-existing env failures on softioc/pydm/zmq-port were reproduced on unmodifiedpre-releaseand are not from this change.)./scripts/run_linters.sh— clean.