Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort#564
Conversation
- Incorporate uv-based CI workflow from codex/migrate-ci-workflow-to-use-uv-projects - Add [tool.uv] no-binary-package = ['mpi4py'] to force source build - Fix run-tests.sh to skip pip check when --sdist-mode is passed Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/66d28b1c-6cb1-4922-9d90-d7f552a612b6 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
- pydicom: >=3.0.2 (path traversal CVE, was 2.4.4/3.0.1) - pyOpenSSL: >=26.0.0 (DTLS buffer overflow, was 25.3.0) - tornado: >=6.5.5 (DoS via multipart, was 6.5.2) - wheel: >=0.46.2 (path traversal file permissions, was 0.45.1) - requires-python bumped to >=3.10 (pydicom 3.x requires 3.10+; Python 3.9 was already removed from CI matrix in PR #562) - Regenerate uv.lock with all patched versions Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/66d28b1c-6cb1-4922-9d90-d7f552a612b6 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
The Synchronize project environment step built C extensions using /usr/bin/c++ (Apple clang, no OpenMP) because CC/CXX were only set in the later Run static analysis and tests step. Set CC/CXX to Homebrew LLVM clang before uv sync on macOS so find_package(OpenMP) succeeds and omp.h is found. Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/54422bd2-e720-4ac6-9b72-a70f095c6292 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
- Remove global no-binary-package = ["mpi4py"] from pyproject.toml; pass --no-binary-package mpi4py only on Linux in CI (where pre-built wheels cause libmpi.so.12 lookup failures). On Windows the source build fails with TypeError: copy_file() got an unexpected keyword argument 'dry_run' in mpi4py 4.1.1's mpidistutils.py. - Change fail-fast: true to fail-fast: false so a single job failure does not cancel all other CI matrix jobs. Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/3d006e66-5443-4a79-bc65-7ae6be15eb99 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
… MPI timeout - src/brainiak/funcalign/fastsrm.py: replace private np.lib.format._read_array_header (removed in numpy 2.3.x) with public read_array_header_1_0 / read_array_header_2_0 APIs; also wrap the open() call in a context manager so the file is always closed, fixing the Windows PermissionError [WinError 32] that occurred when an exception left the file handle open during temp directory cleanup. - .github/workflows/ci.yml: split the MPI setup step into three OS-specific steps: msmpi on Windows (unchanged), mpich on Linux (OpenMPI tends to hang in GitHub Actions runners causing 120 s timeouts), default OpenMPI on macOS (unchanged). Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/fa58e70d-812d-429c-b5ce-3283f7fd8f2b Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
…bort The pool_size fixture in tests/conftest.py had a logic bug: the condition `multiprocessing.get_start_method() != "fork" and APPNUM is not None` always evaluates to False on Linux because Linux Python defaults to the "fork" start method. With MPICH on Linux this caused pool_size=2 to be returned, so the tests created a multiprocessing.Pool(2) inside MPI rank processes. The resulting fork()-after-MPI_Init triggered the libfabric EFA provider (used by AWS-hosted GitHub Actions runners) to print "Your job will now abort." and send SIGABRT, after which mpiexec itself hung indefinitely, causing every affected test to hit the 120-second timeout. Fix: drop the get_start_method() guard from the APPNUM check. Now, any time we are running inside an MPI subprocess (APPNUM is not None), pool_size=1 is returned regardless of the multiprocessing start method. This prevents fork()-inside-MPI on all platforms. The now-unused `multiprocessing` import is also removed. Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/a090ab66-7a3f-48dc-8123-94b324ab2d1d Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
|
What do you think, @davidt0x? Have a look at the diff against PR #562: I'm not sure how the pytest pool size change affects MPI code coverage. I also don't think we need MPICH on Linux, but it's probably better to have different backends. |
|
Looks pretty reasonable to me. I do remember there are combinations of OS\MPI-flavor and multiprocess pools that work, I think MPICH on Mac it works if spawn is used. Maybe intel-mpi and windows as well. However, I don't think the coverage should change too much if we just disable it everywhere in CI. We would probably lose coverage of the pool_size>1 branch: brainiak/src/brainiak/searchlight/searchlight.py Lines 444 to 456 in be703ac brainiak/src/brainiak/fcma/voxelselector.py Lines 444 to 452 in be703ac Maybe we should look into using mpi4py.futures based on this recommendation. Anyway, I think I am fine with this work around for now. |
It looks like we need an actual review from you, @davidt0x. :) |
Three independent CI failures exposed after
fail-fast: falsewas added to the matrix. macOS and Windows now pass; this PR fixes the remaining failures across all platforms.fastsrm.py: numpy 2.3 removed_read_array_headernp.lib.format._read_array_headerwas a private API removed in numpy 2.3. Replaced with the versioned public API and added awithcontext manager (the bareopen()was also causingPermissionError: [WinError 32]on Windows when the file stayed open during temp-dir cleanup):CI: MPI platform split — MPICH on Linux
OpenMPI on Ubuntu GitHub Actions runners hangs when
mpiexecspawns subprocesses. Split thempi4py/setup-mpistep into OS-specific variants:mpichon Linux, default (OpenMPI) on macOS,msmpion Windows (unchanged).tests/conftest.py:pool_sizefixture logic bug causing EFA abort + mpiexec hangThe
pool_sizefixture was supposed to return1inside MPI subprocesses to preventfork()-after-MPI_Init. The guard was:On Linux,
get_start_method()returns"fork"by default, so theAPPNUMcheck was never reached. With MPICH,pool_size=2was returned,multiprocessing.Pool(2)forked inside MPI ranks, which triggered the libfabric EFA provider on AWS runners to emit"Your job will now abort."+ SIGABRT.mpiexecthen hung indefinitely → every affected test hit the 120 s timeout.Fix: drop the start-method guard so
APPNUM is not Nonealone gates the MPI-subprocess path:Also removes the now-unused
import multiprocessing.