Skip to content

Add ATLAS_DEBUG_SYNC macros with a timeout to catch deadlocks#385

Merged
wdeconinck merged 4 commits into
developfrom
feature/ATLAS_DEBUG_SYNC
Jun 13, 2026
Merged

Add ATLAS_DEBUG_SYNC macros with a timeout to catch deadlocks#385
wdeconinck merged 4 commits into
developfrom
feature/ATLAS_DEBUG_SYNC

Conversation

@wdeconinck

@wdeconinck wdeconinck commented Jun 12, 2026

Copy link
Copy Markdown
Member

This pull request introduces a new, robust, and synchronized debugging and logging mechanism for MPI-parallel applications in Atlas. It replaces the older ATLAS_DEBUG_PARALLEL macros with new ATLAS_DEBUG_SYNC macros, which ensure output is properly synchronized across MPI ranks, handle timeouts, and provide flexible configuration for debugging complex parallel scenarios. It also adds comprehensive tests to validate the new debugging features.

The most important changes are:

Synchronized Debugging and Logging Functionality

  • Introduced new ATLAS_DEBUG_SYNC and ATLAS_DEBUG_SYNC_VAR macros in Log.h to replace the old ATLAS_DEBUG_PARALLEL macros, providing synchronized logging across MPI ranks with flexible argument handling and improved usability.
  • Implemented a new set of debug_sync functions in Log.cc that handle MPI barriers with configurable timeouts and actions (abort, throw, continue), as well as synchronized flushing of output streams and detailed backtraces for debugging.

Timeout and Error Handling for MPI Barriers

  • Added logic to detect and handle MPI barrier timeouts, including configurable actions via the ATLAS_MPI_BARRIER_TIMEOUT_ACTION environment variable and reporting of ranks that timed out.

Code Modernization and Refactoring

  • Refactored and removed the old debug_parallel_here and debug_parallel_what functions, fully replacing them with the new debug_sync infrastructure. [1] [2]
  • Improved formatting in the MacOS backtrace output for better readability.

Testing

  • Added a comprehensive test suite (test_debug_logging.cc) and corresponding CMake test targets to validate the new debugging macros and timeout behaviors in both serial and MPI-parallel contexts. [1] [2]

Dependency and Include Updates

  • Updated includes in Log.cc to support new features, such as threading, chrono, and data synchronization, and to ensure compatibility across platforms.

These changes collectively improve the reliability, configurability, and usability of debugging and logging in parallel Atlas applications.

💣💥☠️ Static Analyzer Report ☠️💥💣
https://sites.ecmwf.int/docs/atlas/static-analyzer/PR-385

@wdeconinck wdeconinck changed the title Feature/atlas debug sync Add ATLAS_DEBUG_SYNC macros with a timeout to catch deadlocks Jun 12, 2026
@wdeconinck wdeconinck merged commit f9aa4bf into develop Jun 13, 2026
217 checks passed
@wdeconinck wdeconinck deleted the feature/ATLAS_DEBUG_SYNC branch June 13, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant