Tags: pytorch/kineto
Tags
Expose comms_id in traces Summary: # Context Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance. How comms_id is computed comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize) - pg_name — identifies the process group - seqNumber — per-PG operation counter, identifies which operation within the PG - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters - globalRankStart, globalRankStride, worldSize — encodes the communicator topology, disambiguating cases where one PG creates multiple communicators (e.g., comm splits) Changes by layer 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack. 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map. 3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers. 4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering: - Storage/retrieval of seqNumber and isP2P - Default values - End-to-end: comms_id appears in saveNcclMeta() output with correct hash - Determinism across instances - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies Differential Revision: D95659539
PreviousNext