Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort by Copilot · Pull Request #564 · brainiak/brainiak

Copilot · 2026-03-30T18:32:00Z

Three independent CI failures exposed after fail-fast: false was added to the matrix. macOS and Windows now pass; this PR fixes the remaining failures across all platforms.

`fastsrm.py`: numpy 2.3 removed `_read_array_header`

np.lib.format._read_array_header was a private API removed in numpy 2.3. Replaced with the versioned public API and added a with context manager (the bare open() was also causing PermissionError: [WinError 32] on Windows when the file stayed open during temp-dir cleanup):

# Before
f = open(path, "rb")
version = np.lib.format.read_magic(f)
shape, fortran_order, dtype = np.lib.format._read_array_header(f, version)
f.close()

# After
with open(path, "rb") as f:
    version = np.lib.format.read_magic(f)
    if version == (1, 0):
        shape, fortran_order, dtype = np.lib.format.read_array_header_1_0(f)
    elif version == (2, 0):
        shape, fortran_order, dtype = np.lib.format.read_array_header_2_0(f)
    ...

CI: MPI platform split — MPICH on Linux

OpenMPI on Ubuntu GitHub Actions runners hangs when mpiexec spawns subprocesses. Split the mpi4py/setup-mpi step into OS-specific variants: mpich on Linux, default (OpenMPI) on macOS, msmpi on Windows (unchanged).

`tests/conftest.py`: `pool_size` fixture logic bug causing EFA abort + mpiexec hang

The pool_size fixture was supposed to return 1 inside MPI subprocesses to prevent fork()-after-MPI_Init. The guard was:

if (multiprocessing.get_start_method() != "fork" and   # always False on Linux
        MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is not None):
    return 1

On Linux, get_start_method() returns "fork" by default, so the APPNUM check was never reached. With MPICH, pool_size=2 was returned, multiprocessing.Pool(2) forked inside MPI ranks, which triggered the libfabric EFA provider on AWS runners to emit "Your job will now abort." + SIGABRT. mpiexec then hung indefinitely → every affected test hit the 120 s timeout.

Fix: drop the start-method guard so APPNUM is not None alone gates the MPI-subprocess path:

# Whenever inside an mpiexec subprocess, avoid fork() regardless of start method
if MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is not None:
    return 1

Also removes the now-unused import multiprocessing.

- Incorporate uv-based CI workflow from codex/migrate-ci-workflow-to-use-uv-projects - Add [tool.uv] no-binary-package = ['mpi4py'] to force source build - Fix run-tests.sh to skip pip check when --sdist-mode is passed Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/66d28b1c-6cb1-4922-9d90-d7f552a612b6 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>

- pydicom: >=3.0.2 (path traversal CVE, was 2.4.4/3.0.1) - pyOpenSSL: >=26.0.0 (DTLS buffer overflow, was 25.3.0) - tornado: >=6.5.5 (DoS via multipart, was 6.5.2) - wheel: >=0.46.2 (path traversal file permissions, was 0.45.1) - requires-python bumped to >=3.10 (pydicom 3.x requires 3.10+; Python 3.9 was already removed from CI matrix in PR #562) - Regenerate uv.lock with all patched versions Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/66d28b1c-6cb1-4922-9d90-d7f552a612b6 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>

The Synchronize project environment step built C extensions using /usr/bin/c++ (Apple clang, no OpenMP) because CC/CXX were only set in the later Run static analysis and tests step. Set CC/CXX to Homebrew LLVM clang before uv sync on macOS so find_package(OpenMP) succeeds and omp.h is found. Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/54422bd2-e720-4ac6-9b72-a70f095c6292 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>

- Remove global no-binary-package = ["mpi4py"] from pyproject.toml; pass --no-binary-package mpi4py only on Linux in CI (where pre-built wheels cause libmpi.so.12 lookup failures). On Windows the source build fails with TypeError: copy_file() got an unexpected keyword argument 'dry_run' in mpi4py 4.1.1's mpidistutils.py. - Change fail-fast: true to fail-fast: false so a single job failure does not cancel all other CI matrix jobs. Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/3d006e66-5443-4a79-bc65-7ae6be15eb99 Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>

… MPI timeout - src/brainiak/funcalign/fastsrm.py: replace private np.lib.format._read_array_header (removed in numpy 2.3.x) with public read_array_header_1_0 / read_array_header_2_0 APIs; also wrap the open() call in a context manager so the file is always closed, fixing the Windows PermissionError [WinError 32] that occurred when an exception left the file handle open during temp directory cleanup. - .github/workflows/ci.yml: split the MPI setup step into three OS-specific steps: msmpi on Windows (unchanged), mpich on Linux (OpenMPI tends to hang in GitHub Actions runners causing 120 s timeouts), default OpenMPI on macOS (unchanged). Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/fa58e70d-812d-429c-b5ce-3283f7fd8f2b Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>

…bort The pool_size fixture in tests/conftest.py had a logic bug: the condition `multiprocessing.get_start_method() != "fork" and APPNUM is not None` always evaluates to False on Linux because Linux Python defaults to the "fork" start method. With MPICH on Linux this caused pool_size=2 to be returned, so the tests created a multiprocessing.Pool(2) inside MPI rank processes. The resulting fork()-after-MPI_Init triggered the libfabric EFA provider (used by AWS-hosted GitHub Actions runners) to print "Your job will now abort." and send SIGABRT, after which mpiexec itself hung indefinitely, causing every affected test to hit the 120-second timeout. Fix: drop the get_start_method() guard from the APPNUM check. Now, any time we are running inside an MPI subprocess (APPNUM is not None), pool_size=1 is returned regardless of the multiprocessing start method. This prevents fork()-inside-MPI on all platforms. The now-unused `multiprocessing` import is also removed. Agent-Logs-Url: https://github.com/brainiak/brainiak/sessions/a090ab66-7a3f-48dc-8123-94b324ab2d1d Co-authored-by: mihaic <165546+mihaic@users.noreply.github.com>

mihaic · 2026-03-30T23:20:13Z

What do you think, @davidt0x? Have a look at the diff against PR #562:
https://github.com/brainiak/brainiak/compare/codex/migrate-ci-workflow-to-use-uv-projects..copilot/fix-libmpi-error

I'm not sure how the pytest pool size change affects MPI code coverage. I also don't think we need MPICH on Linux, but it's probably better to have different backends.

davidt0x · 2026-04-02T14:04:42Z

Looks pretty reasonable to me. I do remember there are combinations of OS\MPI-flavor and multiprocess pools that work, I think MPICH on Mac it works if spawn is used. Maybe intel-mpi and windows as well. However, I don't think the coverage should change too much if we just disable it everywhere in CI. We would probably lose coverage of the pool_size>1 branch:

brainiak/src/brainiak/searchlight/searchlight.py

Lines 444 to 456 in be703ac

    
           if processes > 1: 
        
               with Pool(processes) as pool: 
        
                   for idx, block in enumerate(self.blocks): 
        
                       result = pool.apply_async( 
        
                           block_fn, 
        
                           ([subproblem[idx] for subproblem in self.subproblems], 
        
                            self.submasks[idx], 
        
                            self.sl_rad, 
        
                            self.bcast_var, 
        
                            extra_block_fn_params)) 
        
                       results.append((block[0], result)) 
        
                   local_outputs = [(result[0], result[1].get()) 
        
                                    for result in results]

and the use_multiprocessing=True branch:

brainiak/src/brainiak/fcma/voxelselector.py

Lines 444 to 452 in be703ac

    
           if isinstance(clf, sklearn.svm.SVC) and clf.kernel == 'precomputed'\ 
        
                   and self.use_multiprocessing: 
        
               inlist = [(clf, i + task[0], self.num_folds, data[i, :, :], 
        
                          self.labels) for i in range(task[1])] 
        
               with multiprocessing.Pool(self.process_num) as pool: 
        
                   results = list(pool.starmap(_cross_validation_for_one_voxel, 
        
                                               inlist)) 
        
           else:

.

Maybe we should look into using mpi4py.futures based on this recommendation.

Anyway, I think I am fine with this work around for now.

mihaic · 2026-04-02T21:45:55Z

Approvals from users that collaborated with the coding agent on changes will not satisfy review requirements.

It looks like we need an actual review from you, @davidt0x. :)

davidt0x

LGTM

Initial plan

8fc13cb

Copilot AI assigned Copilot and mihaic Mar 30, 2026

Copilot started work on behalf of mihaic March 30, 2026 18:32 View session

Copilot AI and others added 2 commits March 30, 2026 18:44

Copilot AI changed the title ~~[WIP] Fix libmpi error during tests~~ Fix libmpi.so.12 import error and patch vulnerable dependencies in uv-based CI Mar 30, 2026

Copilot AI requested a review from mihaic March 30, 2026 18:52

Copilot finished work on behalf of mihaic March 30, 2026 18:52

Copilot started work on behalf of mihaic March 30, 2026 19:10 View session

Copilot AI changed the title ~~Fix libmpi.so.12 import error and patch vulnerable dependencies in uv-based CI~~ Fix macOS CI build: set CC/CXX before uv sync; patch vulnerable deps; fix mpi4py libmpi.so error Mar 30, 2026

Copilot finished work on behalf of mihaic March 30, 2026 19:16

Copilot started work on behalf of mihaic March 30, 2026 19:56 View session

Copilot AI changed the title ~~Fix macOS CI build: set CC/CXX before uv sync; patch vulnerable deps; fix mpi4py libmpi.so error~~ Fix macOS omp.h not found, Windows mpi4py build failure, and CI fail-fast Mar 30, 2026

Copilot finished work on behalf of mihaic March 30, 2026 20:02

Copilot started work on behalf of mihaic March 30, 2026 21:01 View session

Copilot AI changed the title ~~Fix macOS omp.h not found, Windows mpi4py build failure, and CI fail-fast~~ Fix Windows mpi4py build error, Linux MPI timeouts, and numpy 2.3 compatibility Mar 30, 2026

Copilot finished work on behalf of mihaic March 30, 2026 21:14

Copilot started work on behalf of mihaic March 30, 2026 22:32 View session

Copilot AI changed the title ~~Fix Windows mpi4py build error, Linux MPI timeouts, and numpy 2.3 compatibility~~ Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort Mar 30, 2026

Copilot finished work on behalf of mihaic March 30, 2026 22:42

mihaic marked this pull request as ready for review March 30, 2026 23:15

mihaic enabled auto-merge (squash) April 2, 2026 21:44

mihaic approved these changes Apr 2, 2026

View reviewed changes

davidt0x self-requested a review April 2, 2026 23:56

davidt0x approved these changes Apr 2, 2026

View reviewed changes

mihaic merged commit d0c4892 into master Apr 2, 2026
18 checks passed

mihaic deleted the copilot/fix-libmpi-error branch April 2, 2026 23:57

mihaic mentioned this pull request Apr 3, 2026

Lock uv dependencies for CI #562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort#564

Fix CI failures: numpy 2.3 compat, Windows file handle leak, Linux MPI EFA abort#564
mihaic merged 7 commits into
masterfrom
copilot/fix-libmpi-error

Copilot AI commented Mar 30, 2026 •

edited

Loading

Uh oh!

mihaic commented Mar 30, 2026

Uh oh!

davidt0x commented Apr 2, 2026

Uh oh!

mihaic commented Apr 2, 2026

Uh oh!

davidt0x left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fastsrm.py: numpy 2.3 removed _read_array_header

CI: MPI platform split — MPICH on Linux

tests/conftest.py: pool_size fixture logic bug causing EFA abort + mpiexec hang

Uh oh!

mihaic commented Mar 30, 2026

Uh oh!

davidt0x commented Apr 2, 2026

Uh oh!

mihaic commented Apr 2, 2026

Uh oh!

davidt0x left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 30, 2026 •

edited

Loading

`fastsrm.py`: numpy 2.3 removed `_read_array_header`

`tests/conftest.py`: `pool_size` fixture logic bug causing EFA abort + mpiexec hang