Tags: tpatki/ovis
Tags
OVIS-4.3.8 * Numerous bug fixes * Multi-threaded low-level Zap transport event handers * Command line option support in configuration files * Summary set, transport, producer, and thread statistics * Kokkos Appmon store * Darshan store * Non-blocking event logging * Netlink notifier stream sampler
This is OVIS-4.3.7 Release New Features: * Improved LDMSD Streams Performance * Improved ib_verbs backward compatability * Per-device procnet sampler * Per-device ibmad sampler * AMD GPU sampler * Per-mount Lustre samplers * Various reliability and resiliency improvements Fixes: * LDMSD Streams Memory Leak fixes * Resolved confusing uGNI error messages on exit * Fixed store rename issues in CSV store
OVIS-4.3.6 Release tag Features: * prdcr_stat command to report ldmsd producer statistics * set_stat command to report active ldmsd set counts and memory usage * Support for multi-step slurm jobs in the PAPI sampler - the app_id in the metric set is now the step id. * Partial support for multi-step slurm jobs in the Slurm sampler - the app_id in the metric set is now the step id. * TimescaleDB storage plugin Bug Fixes: * Fix spinning IO thread bug in the socket transport * Fix build failure for older OFA (ib_verbs) libraries * Fix build failure for missing openssl when auth enabled * Fix use after free bug in RBD cleanup * Fix RBD leak in the set delete path * Fix potential deadlock in Zap RDMA
This is the OVIS-4.3.5 G/A Release This release includes the following features and fixes: * Compatability with OVIS-4.3.3 and OVIS-4.3.4 * Support for the Maestro load balancer * Allow root user to access ldmsd configuration objects regardless of euid/egid of the process * Zap socket performance improvements * Zap fabric performance and resiliency improvements * Zap RDMA support for OmniPath * Zap uGNI resiliency improvements * Fix LDMS Streams Service data loss on process exit * Metric set permission handling improvements * Fixes for memory leaks and uninitialized data found by static analysis tools * Numerous build and packaging improvements
This is the OVIS-4.3.4 G/A Release Significant testing on the socket, RDMA, and uGNI transports has been done with Socket and uGNI scaling to three levels of aggregation and 30,000 sets in the aggregate. The RDMA transport has been tested to a few thousands of sets. The fabric transport should be considered Alpha and is suitable for development, but not deployment at this time. This release includes the following new features * LDMS Transport performance statistics (ldmsd_controller xprt_stats command) * Zap Thread utilization tracking (ldmsd_controller thread_stats command) * uGNI resliency improvements to aid with resource error handling * Packaging updates and github automation to help with tarball generation and release tagging * A reference counting service has been implemented that supports 'named references'. In debug mode (when REF_TRACK is defined), references are tracked (function name, and line number) when they are taken and when they are released, and individual reference counts are kept for each name. This makes it easier to debug reference tracking during development. * The new ref_t reference counting mechanism has been added to struct ldms_set and struct ldms_rbuf_desc in support of a robust set-delete capability * An "end-to-end" protocol has been added for deleting metric sets. When an ldmsd deletes a set, each peer that has a memory handle on the set is notified. The set resources are not freed until all peers acknowledge that they have received the delete notification. * A service (zap_zerr2errno) has been added to consistently map Zap errors to Unix errno * Updates to the lustre2_client sampler to support newer version of Lustre
This release includes the following updates and fixes: * Packaging updates and github automation to help with tarball generation and release tagging * Fixes for issues found by static analysis tools * The JSON parser had a memory leak that on the socket transport could leak as much as 1MB per message * A service (zap_zerr2errno) has been added to consistently map Zap errors to Unix errno * A reference counting service has been implemented that supports 'named references'. In debug mode (when REF_TRACK is defined), references are tracked (function name, and line number) when they are taken and when they are released, and individual reference counts are kept for each name. This makes it easier to debug reference tracking during development. * The new ref_t reference counting mechanism has been added to struct ldms_set and struct ldms_rbuf_desc in support of a robust set-delete capability * An "end-to-end" protocol has been added for deleting metric sets. When an ldmsd deletes a set, each peer that has a memory handle on the set is notified. The set resources are not freed until all peers acknowledge that they have received the delete notification. * LDMS transport 'telemetry' data has been added that tracks statistics on the primary transport operations DIR, LOOKUP, UPDATE, SEND, and RECV. The intent is to determine when/if an ldmsd becomes overloaded, underutilized, etc... * Zap uGNI Transport fixes * Ensure socket is closed in uGNI transport * Destroy the Cdm in the uGNI transport * Refactor Zap uGNI disconnect path * Aggressively flush incomplete RdmaPost descriptors. * Add more detailed error handling in Zap uGNI * Added a thread to subscribe to and report errors on the uGNI transport. * Make certain that GNI_EpUnbind does not fail. This ensures that NTT resources held by the endpoint are released.
PreviousNext