Release Candidate v6.13.0#1226
Merged
Merged
Conversation
Detect a pre-existing non-PyDMApplication QApplication at runPyDM entry, print a step-by-step diagnostic to stderr first (so the message survives caller try/finally: sys.exit(...) wrappers), and raise RuntimeError. The diagnostic names the detected class (module.QualName) and instance id so callers can confirm they really have a plain QApplication vs another QApplication subclass. Background: downstream callers (e.g. wave8DAQ.py) sometimes construct a plain QApplication(sys.argv) before calling runPyDM. PyQt5 silently keeps two distinct app objects; PyDM's window-management hooks (XSync counter, _NET_WM_PING reply) install on PyDMApplication, but the visible windows are bound to the plain QApplication. The compositor's ping/sync messages are never answered, so the GUI is flagged as 'not responding' on GNOME-on-Wayland with XWayland. Bisection isolated the dual-QApplication pattern as the cause, independent of any rogue widget/listener path. Tests: add tests/interfaces/test_pydm_qapplication_guard.py exercising the guard in an isolated subprocess (parent never imports Qt/PyDM). Asserts RuntimeError, that the detected app type is reported as qtpy.QtWidgets.QApplication (not PySide*/PyQt*), and that the stderr banner is emitted before any post-exception output. Headless via QT_QPA_PLATFORM=offscreen. Docs: add a Notes section to the runPyDM docstring (rendered into the API page via autofunction) and a new 'Do Not Pre-Construct A QApplication' section in docs/src/pydm/starting_gui.rst with an anti-example, pointing callers at the existing display / display_factory / pydm.Display paths for companion widgets.
Move saveYaml, loadYaml, setYaml, and supporting methods from Root to Device so any device in the tree can load/save YAML configuration using device-relative paths without knowing the full tree structure. Device gets: - saveYaml(), loadYaml(), setYaml() with full file/dir/zip support - _applyYamlDict() dispatch hook using device-relative nodeMatch() - _writeConfig() scoped to device subtree Root overrides: - _applyYamlDict() delegates to _setDictRoot() for absolute-path compat - _writeConfig() delegates to _write() for full-tree commit - loadYaml()/setYaml() add InitAfterConfig check Backward compatible: all existing Root-level YAML operations, commands (SaveConfig/LoadConfig), and YAML file formats work unchanged.
Slice of #1196 (general-bug-audit-clean branch) covering the drivers-sync subset of the bug-audit fixes.
… build improvements
Conda recipe (meta.yaml, build.sh):
- Switch from monolithic boost to split libboost-python-devel/libboost-python
- Add run_exports, skip directive (Linux-only), SPDX license, recipe-maintainers
- Add test section (imports, version assertion, pip check) for linter compliance
- Use ${CMAKE_ARGS} and portable cmake commands in build.sh
- Add all runtime deps with version bounds
- Move matplotlib to core deps (eagerly imported at module level)
- Remove conda_build_config.yaml; let conda-forge migrator handle variants
- Fall back to nproc when CPU_COUNT is unset for manual builds
CMake fixes:
- Use ${Python3_EXECUTABLE} instead of bare python3/pip3 to prevent macOS
system Python from shadowing the conda env Python
- Set Python3_ROOT_DIR and Python3_FIND_FRAMEWORK=NEVER when CONDA_PREFIX is set
- Remove -std=c++11 flag that conflicted with CMAKE_CXX_STANDARD 14
- Replace git FATAL_ERROR with warning + v0.0.0 fallback for conda-forge builds
- Add COMMAND_ERROR_IS_FATAL to pip install step
- Bump minimum CMake version to 3.19
Python packaging:
- Add install_requires to setup.py.in so pip check validates the dep graph
- Move pydm/qtpy/pyqtgraph/matplotlib to extras_require['gui']
C++ bug fix:
- Reorder Fifo member declarations so threadEn_ initializes before thread_
Docs:
- Add conda-forge submission README with recipe notes
- Update CMake prerequisite to 3.19 in build instructions
Validated: conda build + import test passed for Python 3.10-3.13.
numpy <2.0 does not support Python 3.13. Pin host numpy to >=2.0 to ensure conda-build resolves a compatible version for all supported Python variants (3.10–3.13).
Hardens the Linux hardware bridge layer (MemMap, AxiMemMap,
AxiStreamDma) against thread-lifetime, file-descriptor, and
shared-state bugs surfaced by the bug audit. Public API signatures
are unchanged; fixes live in implementation and internal members.
Each bug class is locked down by a source-audit doctest regression.
Thread ownership (RAII)
- Replace raw new std::thread + delete with std::unique_ptr<std::thread>
in MemMap, AxiMemMap, AxiStreamDma. Eliminates leaks on
construction-failure paths and ensures cleanup ordering.
FD_SETSIZE bounds checks (AxiStreamDma)
- Reject fd_ >= FD_SETSIZE synchronously in the ctor (close + throw
GeneralError) before the worker thread starts.
- In-loop FD_SETSIZE guards in acceptReq()/acceptFrame() include the
numeric limit and identify likely root causes (post-stop instance
vs process fd-table exhaustion).
- runThread() logs and exits the worker cleanly on out-of-range fd
instead of throwing across the thread boundary (which would call
std::terminate()).
Shared-buffer state race (AxiStreamDma)
- Add static std::mutex sharedBuffersMtx covering openShared,
closeShared, zeroCopyDisable, and the ctor-time reads/writes of
desc_->bCount / desc_->bSize. Add explicit #include <mutex>.
Drop dead #if 0 block.
openShared() error-path stale-fd leak (AxiStreamDma)
- closeShared() leaves entries in sharedBuffers_ after openCount drops
to 0 (only fd/bCount/bSize/rawBuff are reset). On error paths after
::open succeeded but a later step (driver version, dmaCheckVersion,
dmaMapDma) failed, the original code closed the fd but left ret->fd
non-negative; the next caller would re-enter the persistent record
and reuse a stale descriptor, returning EBADF on every op.
- Reset ret->fd = -1 (and scrub bCount/bSize/rawBuff in the dmaMapDma
failure path) on every error path before throwing. Comments
disambiguate ret->fd (shared record field) from AxiStreamDma::fd_
(per-instance member).
openCount accounting (AxiStreamDma)
- AxiStreamDmaShared ctor initializes openCount = 0 (was 1, double-
counting entries seeded by zeroCopyDisable() before any AxiStreamDma
user existed).
- openShared() increments openCount on every successful return path
(existing-open, zero-copy-disabled fall-through, and freshly opened
branch). The new-entry increment is placed after open + dmaMapDma
succeed so a pre-insertion throw cannot leak openCount.
- closeShared() guards ::close(desc->fd) with if (desc->fd != -1) -
avoids close(-1) returning EBADF every time the count reaches 0 in
the zeroCopyDisable() flow.
stop() idempotence + worker self-exit (AxiStreamDma)
- Now that runThread() can self-exit via the in-loop fd guard
(sets threadEn_ = false and returns), the previous if (threadEn_)-
gated cleanup would skip thread_->join() and ~unique_ptr<std::thread>
on a still-joinable thread would call std::terminate().
- Drive cleanup from per-resource state: thread_->joinable() for the
join, desc_ presence for closeShared (with desc_.reset() to prevent
double-close), fd_ >= 0 for the ::close. threadEn_ = false is set
unconditionally so the worker still receives the stop signal.
Function is idempotent across user-stop -> dtor and worker self-exit
-> stop sequences.
Use-after-stop guards (AxiStreamDma::acceptReq / acceptFrame)
- stop() resets desc_ to nullptr and closes fd_; previously
acceptReq() dereferenced desc_->bSize at function entry and
segfaulted on a post-stop call, and acceptFrame() only checked the
fd inside the inner write loop after frame->lock() and buffer
iteration had already run.
- Add fail-fast !desc_ || fd_ < 0 entry guards in both methods that
throw GeneralError("...stopped or did not finish construction").
Source-audit regression tests (tests/cpp/protocols/hardware/)
Each test scans the source for the audited invariant
(comment-stripped to avoid false positives):
- test_memmap_raw_thread_audit_repro.cpp
- test_axi_memmap_raw_thread_audit_repro.cpp
- test_axi_stream_dma_raw_thread_audit_repro.cpp
- test_axi_stream_dma_fd_setsize_audit_repro.cpp
- test_axi_stream_dma_disabled_block_audit_repro.cpp
- test_axi_stream_dma_shared_static_race_audit_repro.cpp
- test_axi_stream_dma_open_shared_stale_fd_audit_repro.cpp
- test_axi_stream_dma_open_count_audit_repro.cpp
- test_axi_stream_dma_stop_joinable_audit_repro.cpp
- test_axi_stream_dma_use_after_stop_audit_repro.cpp
…ship - RAII thread ownership (unique_ptr) in memory TcpClient/TcpServer and stream TcpCore — eliminates leak-on-throw in constructors - [[deprecated]] annotations on legacy close() methods; prefer stop() - Full zmq_sendmsg error checking with socket rebuild on partial multi-part send failure (both memory TcpServer and stream TcpCore) - Serialise TcpCore::stop() teardown with bridgeMtx_ so in-flight acceptFrame() calls complete before sockets are freed - Guard zmqPush_ null in acceptFrame(); self-heal via rebuildPushSocket() - Surface expired-transaction warnings in memory TcpClient (drop stale responses rather than silently mutating completed transactions) - Nullptr-safe stop/dtor teardown for both memory and stream classes - Expose _stop() to Python bindings for stream TcpClient/TcpServer and memory TcpClient - Update stream TCP bridge guide to note close() deprecation - Add C++ doctest and Python integration test repros
…+excGroups and writeEach=True.
feat(pyrogue::Device): move YAML config loading from Root to Device
fix(hardware): MemMap/AxiMemMap/AxiStreamDma audit hardening
…uard Remove inline comments that duplicate the error message and docstring. Switch from print(stderr) to logging.getLogger(__name__).error() for consistency with the rest of the pydm subpackage.
sync(drivers): refresh DmaDriver.h from aes-stream-driver
Strip multi-paragraph explanatory comments added by the hardening commit that restate what the code already makes obvious. Retain concise 1-liner comments only where the intent is non-obvious.
…cl warnings
- Close already-initialized zmq_msg parts when a subsequent
zmq_msg_init_size fails in TcpCore::acceptFrame(), preventing
resource leaks on allocation failure.
- Wrap Boost.Python .def("close", ...) bindings in diagnostic pragmas
to suppress -Wdeprecated-declarations from taking the address of the
[[deprecated]] close() methods. The Python binding intentionally
preserves the legacy API while the C++ attribute steers new callers.
Per review feedback — the guard itself is sufficient; a dedicated regression test for this failure mode is overkill.
fix(pyrogue::pydm): reject dual QApplication in runPyDM (+ docs)
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1226 +/- ##
==========================================
+ Coverage 59.29% 59.48% +0.19%
==========================================
Files 72 72
Lines 8245 8284 +39
Branches 1228 1234 +6
==========================================
+ Hits 4889 4928 +39
- Misses 3057 3062 +5
+ Partials 299 294 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add synchronous pre-write listeners that execute before hardware writes, enabling callers to block writes (via WriteBlockedError) or modify the value being written. Supports both variable-level and device-level registration. Includes 17 new pytest cases and narrative documentation.
Add project guidance for agents and tests
…client-server-core
fix(tcp): harden TCP transport teardown, send paths, and thread ownership
fix(conda): conda-forge compliance and cmake Python path fix
test(tcp): cover Python _stop bindings
feat(pyrogue): add pre-write event listeners (ESROGUE-743)
…ROGUE-711)
ZmqServer._varDone() iterated _updateList via pickle.dumps without
synchronization. When a value's __getstate__ released the GIL
(matplotlib transforms during SaveState / SaveConfig was the reported
case), Root._updateWorker could call _varUpdate() concurrently and
mutate the dict mid-iteration, raising:
RuntimeError: dictionary changed size during iteration
Wrap _varUpdate / _varDone in a threading.Lock and snapshot-and-swap
the pending dict before pickle.dumps, so the slow serializer iterates
a private reference no other thread can reach. Initialize lock and
dict before addVarListener so the worker thread cannot fire callbacks
against uninitialized state.
Add a deterministic reproducer test (test_zmq_server_varDone_race_with_slow_pickle)
that uses a slow-pickling value to widen the race window. The existing
race tests in this file used trivial integer values whose pickle is
too fast to reliably trigger the bug; the new test fails 5/5 on
unpatched code and passes 5/5 with the fix.
The same dict-mutate-during-serialize pattern in
pyrogue.interfaces.stream.Variable was investigated but is not
exploitable: PyYAML's represent_mapping calls list(mapping.items())
before iterating, snapshotting the dict atomically. Left untouched.
…ient::send errno (#1234) VirtualClient.__init__ previously wrapped _waitForRoot in try/except, called _cleanupFailedInit() to tear down the ZMQ context, then *printed* a "Failed to connect" banner and *returned normally*. The caller received a VirtualClient whose underlying ZmqClient had been destroyed, with no exception to signal the failure. The first subsequent _remoteAttr reached ZmqClient::send on a torn-down socket and crashed with the issue #1234 traceback: rogue.GeneralError: ZmqClient::send: General Error: zmq_sendmsg failed The underlying defect predates v6.12.0; v6.11.0 silently discarded the zmq_sendmsg return value, so the failure used to surface as the more confusing "Timeout waiting for response". The v6.12.0 throw added in a5bff52 made the dead-instance scenario crash visibly. Three production changes: 1. VirtualClient.__init__ raises ConnectionError (chained from the underlying handshake exception) when _waitForRoot fails, instead of print + return. No dead instance escapes the constructor. _cleanupFailedInit() still runs before the raise so sockets and the ZMQ context are released and the singleton cache stays clean. 2. Liveness guard at the top of VirtualClient._remoteAttr: any path that produces an instance with _vcInitialized=False (post-stop(), partial init, or future regressions) now fails fast with RuntimeError("VirtualClient is not connected: ...") instead of slipping through to the C++ zmq_sendmsg failed throw. 3. ZmqClient::send enriched throw: when zmq_sendmsg returns -1 the thrown rogue::GeneralError now embeds zmq_errno and zmq_strerror, so any residual failure on this path is self-describing. The "zmq_sendmsg failed" substring is preserved so existing message-matching callers continue to work. Behavioural contract change. VirtualClient(addr, port) against an unreachable target now raises ConnectionError instead of returning a dead instance. Callers that depended on the silent-return behaviour need to wrap the constructor in try/except ConnectionError. The exception message preserves the addr/port and the previous "Possible causes" list, and the underlying handshake exception is reachable via __cause__. The change is documented in the VirtualClient.__init__ docstring (autodoc-exposed). Tests: - tests/interfaces/test_zmq_client_efsm_recovery.py (new) eliminates the EFSM hypothesis: ZMQ_REQ_RELAXED has been set in the ZmqClient constructor since ca2c7cb, so the recv-timeout-then-retry path cannot wedge the REQ socket. - tests/interfaces/test_zmq_client_dead_after_failed_init.py (new) pins the new contract: __init__ raises ConnectionError with chained __cause__, and _remoteAttr on a stopped VirtualClient fails fast with RuntimeError rather than reaching ZmqClient::send. - tests/interfaces/test_interfaces_virtual_client_connect.py ::test_virtual_client_does_not_cache_failed_bootstrap is updated to expect ConnectionError and verify the recovery scenario still works. Verification: cmake --build build -j$(nproc) && cmake --build build --target install source build/setup_rogue.sh python -m pytest tests/interfaces -q -m "not perf and not epics" # -> 108 passed, 5 skipped (1 pre-existing pydm deselect unrelated) ctest --test-dir build --output-on-failure -L cpp # -> 13/14 passed (1 pre-existing tcpserver undefined-symbol failure # unrelated, reproduces on pristine pre-release) bash scripts/run_linters.sh # -> EXIT=0 Fixes #1234.
…1235) Two findings from the Copilot review on PR #1235: 1. _bind_responding_peer (and its echo thread) was defined but never used in this test module. Remove the dead helper along with its now-unused `threading` and `time` imports. 2. test_zmq_client_multiple_sends_after_repeated_recv_timeouts only asserted the negative ("not zmq_sendmsg failed"). The test could pass on an unexpected error mode (ETERM, ENOTSOCK, etc.) and silently lose coverage. Add a positive "Timeout" assertion per iteration, mirroring the existing check in test_zmq_client_second_send_after_recv_timeout_does_not_raise_sendmsg_failed.
Copilot review on PR #1235 caught a type-hint mismatch: _remoteAttr is annotated `attr: str` but _waitForRoot legitimately calls it with `attr=None` during the bootstrap handshake. Widen the parameter type to `str | None` to match actual usage and avoid misleading type information for downstream callers and type checkers. No behavioural change.
…send assertion (#1235) Two findings from the Copilot review on PR #1235: 1. The __init__ docstring claimed the underlying handshake exception was reachable via ConnectionError.__cause__, but _remoteAttr re-raised without `from e`, so the underlying transport error landed on __context__ (implicit) rather than __cause__ (explicit). Add `raise ... from e` in _remoteAttr so the rogue.GeneralError from ZmqClient::send is reachable via the standard __cause__ chain, and tighten the __init__ docstring to describe the two-step chain accurately (ConnectionError.__cause__ -> wrapping Exception -> __cause__ -> rogue.GeneralError). 2. test_zmq_client_second_send_after_recv_timeout_does_not_raise_sendmsg_failed asserted only the negative ("not zmq_sendmsg failed") on the first send. If the first send failed on a different error mode (early disconnect / ETERM / etc.), the EFSM precondition for the second send wouldn't actually hold and the test would pass vacuously. Add a positive "Timeout" assertion on the first send, mirroring the existing check on the second send and the per-iteration check added in bb56781.
ZmqClient::send() and ZmqClient::sendString() retry the recv leg forever when setTimeout(..., waitRetry=true) is in effect. The 10 s RCVTIMEO expires but the loop logs and continues, swallowing any genuine timeout intent the caller passed. Issue #1236 is the direct fallout: a SMuRF deployment running rogue at 1a04b1e hung 20+ minutes on a RemoteVariable.get(), only unwedging after the server docker was restarted. The 10 s timeout pysmurf configured did not fire because pysmurf intentionally uses self._client.setTimeout(10000, True) with the comment "Retries forever anyway". Survey of fix status before this patch: - 1a04b1e..v6.12.0: three ZMQ commits landed (a5bff52, c8677e4, 83d3e7d). They harden the server REP FSM and _Virtual locking but do not bound the client retry loop. - git log v6.12.0..pre-release -- ZmqClient.{h,cpp} ZmqServer.{h,cpp} _Virtual.py is empty. - Open ZMQ-related PRs (#1235, #1232, #1202, #1209) cover the bootstrap-failure crash, REP _updateList mutation, server-side hardening, and SqlLogger/SideBandSim. None touch the recv retry cap. This PR adds a single, backwards-compatible opt-in: void setTimeout(uint32_t msecs, bool waitRetry, uint32_t maxRetries = 0); - maxRetries == 0 (default) preserves the historic forever-retry contract that pysmurf, _Virtual.py:482, and any other 2-arg caller relies on. - maxRetries > 0 caps both recv loops in send()/sendString(); after that many RCVTIMEO trips the loop throws the same rogue::GeneralError("Timeout waiting for response after %d Seconds") the waitRetry == false path throws today. The Boost.Python binding switches to keyword-default form: .def("setTimeout", &ZmqClient::setTimeout, (bp::arg("msecs"), bp::arg("waitRetry"), bp::arg("maxRetries") = 0u)) so existing 2-arg call sites compile and behave unchanged. Adopting the cap is a separate one-line patch on each caller (pysmurf, downstream sodetlib, ...); that adoption is out of scope here. Tests: - tests/interfaces/test_zmq_client_bounded_waitretry.py (new): * test_setTimeout_three_arg_form_bounds_retries: silent REP peer, setTimeout(50, True, 3) must raise rogue.GeneralError with "Timeout" in ~150 ms. Fails on pristine pre-release HEAD with Boost.Python.ArgumentError, proving the API surface didn't exist. * test_setTimeout_two_arg_form_keeps_unbounded_retry: backwards- compat pin. Daemon thread running _send must still be alive after 1 s with setTimeout(50, True). Verification (local -DROGUE_INSTALL=local -DROGUE_BUILD_TESTS=ON build): pytest tests/interfaces/test_zmq_client_bounded_waitretry.py -v # -> 2 passed pytest tests/interfaces -q -m "not perf and not epics" # -> 105 passed, 5 skipped, 1 failed # (the 1 failure is # test_pydm_rogue_plugin.py::test_rogue_connection_link_state_refreshes_static_name_channels, # a pre-existing failure that reproduces on pristine pre-release # HEAD and is unrelated to this fix; same failure PR #1235 also # called out as unrelated) pytest tests/integration tests/core -q -m "not perf and not epics" # -> 307 passed ctest --test-dir build --output-on-failure -L cpp # -> 14/14 passed bash scripts/run_linters.sh # -> EXIT=0 Fixes the rogue side of #1236. Pysmurf-side adoption (passing a finite maxRetries to setTimeout) is a separate one-line follow-up.
fix(zmq): cap waitRetry recv loop in ZmqClient (#1236)
fix(zmq): raise from VirtualClient bootstrap + liveness guard + ZmqClient::send errno (#1234)
The ``VirtualClient`` link-monitor thread was created without ``daemon=True``. Combined with ``VirtualClient.ClientCache`` keeping a strong reference to every constructed instance, this kept the Python interpreter alive on ``quit()`` / ``exit()`` in interactive sessions that did not explicitly call ``stop()``. Symptom reported by pysmurf users: ipython hangs forever on quit. Changes: * Mark ``_monThread`` daemon so interpreter shutdown can proceed even when the instance is still pinned in ``ClientCache``. * Add a public ``stopMonitor()`` method that drains the monitor thread without releasing ZMQ resources. ``stop()`` now delegates to ``stopMonitor()`` before tearing down the C++ sockets and removing the instance from ``ClientCache``. This gives long-lived callers (cf. slaclab/pysmurf#991) a library-side knob to silence the heartbeat without losing their connection. * Regression tests in ``tests/interfaces/test_virtual_client_monitor_daemon.py`` pin the daemon flag, ``stopMonitor()`` semantics (ZMQ stays open; real round-trip succeeds; instance remains cached), and the ``stopMonitor()`` -> ``stop()`` sequencing contract. * ``FakeThread`` in ``tests/interfaces/test_interfaces_virtual.py`` now accepts the ``daemon`` kwarg passed by ``VirtualClient.__init__``.
fix(virtualclient): daemon monitor thread + stopMonitor() API (#1238)
fix(zmq): guard ZmqServer._updateList against concurrent mutation (ESROGUE-711)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description