Skip to content

Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort#564

Merged
mihaic merged 7 commits into
masterfrom
copilot/fix-libmpi-error
Apr 2, 2026
Merged

Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort#564
mihaic merged 7 commits into
masterfrom
copilot/fix-libmpi-error

Conversation

Copilot AI commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Three independent CI failures exposed after fail-fast: false was added to the matrix. macOS and Windows now pass; this PR fixes the remaining failures across all platforms.

fastsrm.py: numpy 2.3 removed _read_array_header

np.lib.format._read_array_header was a private API removed in numpy 2.3. Replaced with the versioned public API and added a with context manager (the bare open() was also causing PermissionError: [WinError 32] on Windows when the file stayed open during temp-dir cleanup):

# Before
f = open(path, "rb")
version = np.lib.format.read_magic(f)
shape, fortran_order, dtype = np.lib.format._read_array_header(f, version)
f.close()

# After
with open(path, "rb") as f:
    version = np.lib.format.read_magic(f)
    if version == (1, 0):
        shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f)
    elif version == (2, 0):
        shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f)
    ...

CI: MPI platform split — MPICH on Linux

OpenMPI on Ubuntu GitHub Actions runners hangs when mpiexec spawns subprocesses. Split the mpi4py/setup-mpi step into OS-specific variants: mpich on Linux, default (OpenMPI) on macOS, msmpi on Windows (unchanged).

tests/conftest.py: pool_size fixture logic bug causing EFA abort + mpiexec hang

The pool_size fixture was supposed to return 1 inside MPI subprocesses to prevent fork()-after-MPI_Init. The guard was:

if (multiprocessing.get_start_method() != "fork" and   # always False on Linux
        MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is not None):
    return 1

On Linux, get_start_method() returns "fork" by default, so the APPNUM check was never reached. With MPICH, pool_size=2 was returned, multiprocessing.Pool(2) forked inside MPI ranks, which triggered the libfabric EFA provider on AWS runners to emit "Your job will now abort." + SIGABRT. mpiexec then hung indefinitely → every affected test hit the 120 s timeout.

Fix: drop the start-method guard so APPNUM is not None alone gates the MPI-subprocess path:

# Whenever inside an mpiexec subprocess, avoid fork() regardless of start method
if MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is not None:
    return 1

Also removes the now-unused import multiprocessing.

Copilot AI and others added 2 commits March 30, 2026 18:44
- Incorporate uv-based CI workflow from codex/migrate-ci-workflow-to-use-uv-projects
- Add [tool.uv] no-binary-package = ['mpi4py'] to force source build
- Fix run-tests.sh to skip pip check when --sdist-mode is passed

Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/66d28b1c-6cb1-4922-9d90-d7f552a612b6

Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
- pydicom: >=3.0.2 (path traversal CVE, was 2.4.4/3.0.1)
- pyOpenSSL: >=26.0.0 (DTLS buffer overflow, was 25.3.0)
- tornado: >=6.5.5 (DoS via multipart, was 6.5.2)
- wheel: >=0.46.2 (path traversal file permissions, was 0.45.1)
- requires-python bumped to >=3.10 (pydicom 3.x requires 3.10+;
  Python 3.9 was already removed from CI matrix in PR #562)
- Regenerate uv.lock with all patched versions

Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/66d28b1c-6cb1-4922-9d90-d7f552a612b6

Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix libmpi error during tests Fix libmpi.so.12 import error and patch vulnerable dependencies in uv-based CI Mar 30, 2026
Copilot AI requested a review from mihaic March 30, 2026 18:52
The Synchronize project environment step built C extensions using
/usr/bin/c++ (Apple clang, no OpenMP) because CC/CXX were only set
in the later Run static analysis and tests step. Set CC/CXX to
Homebrew LLVM clang before uv sync on macOS so find_package(OpenMP)
succeeds and omp.h is found.

Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/54422bd2-e720-4ac6-9b72-a70f095c6292

Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
Copilot AI changed the title Fix libmpi.so.12 import error and patch vulnerable dependencies in uv-based CI Fix macOS CI build: set CC/CXX before uv sync; patch vulnerable deps; fix mpi4py libmpi.so error Mar 30, 2026
- Remove global no-binary-package = ["mpi4py"] from pyproject.toml;
  pass --no-binary-package mpi4py only on Linux in CI (where pre-built
  wheels cause libmpi.so.12 lookup failures). On Windows the source
  build fails with TypeError: copy_file() got an unexpected keyword
  argument 'dry_run' in mpi4py 4.1.1's mpidistutils.py.
- Change fail-fast: true to fail-fast: false so a single job failure
  does not cancel all other CI matrix jobs.

Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/3d006e66-5443-4a79-bc65-7ae6be15eb99

Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
Copilot AI changed the title Fix macOS CI build: set CC/CXX before uv sync; patch vulnerable deps; fix mpi4py libmpi.so error Fix macOS omp.h not found, Windows mpi4py build failure, and CI fail-fast Mar 30, 2026
… MPI timeout

- src/brainiak/funcalign/fastsrm.py: replace private
  np.lib.format._read_array_header (removed in numpy 2.3.x) with public
  read_array_header_1_0 / read_array_header_2_0 APIs; also wrap the
  open() call in a context manager so the file is always closed, fixing
  the Windows PermissionError [WinError 32] that occurred when an
  exception left the file handle open during temp directory cleanup.
- .github/workflows/ci.yml: split the MPI setup step into three
  OS-specific steps: msmpi on Windows (unchanged), mpich on Linux
  (OpenMPI tends to hang in GitHub Actions runners causing 120 s
  timeouts), default OpenMPI on macOS (unchanged).

Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/fa58e70d-812d-429c-b5ce-3283f7fd8f2b

Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
Copilot AI changed the title Fix macOS omp.h not found, Windows mpi4py build failure, and CI fail-fast Fix Windows mpi4py build error, Linux MPI timeouts, and numpy 2.3 compatibility Mar 30, 2026
…bort

The pool_size fixture in tests/conftest.py had a logic bug: the condition
`multiprocessing.get_start_method() != "fork" and APPNUM is not None`
always evaluates to False on Linux because Linux Python defaults to the
"fork" start method.  With MPICH on Linux this caused pool_size=2 to be
returned, so the tests created a multiprocessing.Pool(2) inside MPI rank
processes.  The resulting fork()-after-MPI_Init triggered the libfabric
EFA provider (used by AWS-hosted GitHub Actions runners) to print "Your
job will now abort." and send SIGABRT, after which mpiexec itself hung
indefinitely, causing every affected test to hit the 120-second timeout.

Fix: drop the get_start_method() guard from the APPNUM check.  Now, any
time we are running inside an MPI subprocess (APPNUM is not None),
pool_size=1 is returned regardless of the multiprocessing start method.
This prevents fork()-inside-MPI on all platforms.  The now-unused
`multiprocessing` import is also removed.

Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/a090ab66-7a3f-48dc-8123-94b324ab2d1d

Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>
Copilot AI changed the title Fix Windows mpi4py build error, Linux MPI timeouts, and numpy 2.3 compatibility Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort Mar 30, 2026
@mihaic mihaic marked this pull request as ready for review March 30, 2026 23:15
@mihaic

mihaic commented Mar 30, 2026

Copy link
Copy Markdown
Member

What do you think, @davidt0x? Have a look at the diff against PR #562:
https://github.com/brainiak/brainiak/compare/codex/migrate-ci-workflow-to-use-uv-projects..copilot/fix-libmpi-error

I'm not sure how the pytest pool size change affects MPI code coverage. I also don't think we need MPICH on Linux, but it's probably better to have different backends.

@davidt0x

davidt0x commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Looks pretty reasonable to me. I do remember there are combinations of OS\MPI-flavor and multiprocess pools that work, I think MPICH on Mac it works if spawn is used. Maybe intel-mpi and windows as well. However, I don't think the coverage should change too much if we just disable it everywhere in CI. We would probably lose coverage of the pool_size>1 branch:

if processes > 1:
with Pool(processes) as pool:
for idx, block in enumerate(self.blocks):
result = pool.apply_async(
block_fn,
([subproblem[idx] for subproblem in self.subproblems],
self.submasks[idx],
self.sl_rad,
self.bcast_var,
extra_block_fn_params))
results.append((block[0], result))
local_outputs = [(result[0], result[1].get())
for result in results]
and the use_multiprocessing=True branch:
if isinstance(clf, sklearn.svm.SVC) and clf.kernel == 'precomputed'\
and self.use_multiprocessing:
inlist = [(clf, i + task[0], self.num_folds, data[i, :, :],
self.labels) for i in range(task[1])]
with multiprocessing.Pool(self.process_num) as pool:
results = list(pool.starmap(_cross_validation_for_one_voxel,
inlist))
else:
.

Maybe we should look into using mpi4py.futures based on this recommendation.

Anyway, I think I am fine with this work around for now.

@mihaic mihaic enabled auto-merge (squash) April 2, 2026 21:44
@mihaic

mihaic commented Apr 2, 2026

Copy link
Copy Markdown
Member

Approvals from users that collaborated with the coding agent on changes will not satisfy review requirements.

It looks like we need an actual review from you, @davidt0x. :)

@davidt0x davidt0x self-requested a review April 2, 2026 23:56

@davidt0x davidt0x left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mihaic mihaic merged commit d0c4892 into master Apr 2, 2026
18 checks passed
@mihaic mihaic deleted the copilot/fix-libmpi-error branch April 2, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants