Tags · pytorch/kineto

ciflow/rocm/1286

Expose comms_id in traces

Summary:
# Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P       
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  - pg_name — identifies the process group
  - seqNumber — per-PG operation counter, identifies which operation within the PG
  - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
- globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
  disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

 Changes by layer

 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  - Storage/retrieval of seqNumber and isP2P
  - Default values
  - End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  - Determinism across instances
  - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539

Mar 7, 2026
18312be
zip
tar.gz

v0.4.0

update release version to 0.4.0 (#561)

Mar 11, 2022
486c083
zip
tar.gz
Notes

v0.3.1

Update version of plugin to 0.3.1

Oct 22, 2021
fa1084a
zip
tar.gz

v0.3.0

Update version for tb_plugin 0.3.0

Oct 19, 2021
19ad2c9
zip
tar.gz

v0.2.1

tb_plugin 0.2.1

Jul 13, 2021
4e9859a
zip
tar.gz
Notes

v0.2.0

Update README.md (#288)

Jun 15, 2021
112e1b7
zip
tar.gz

v0.2.0rc3

Add communication recommendation for tb_plugin (#273)

Jun 4, 2021
46d76a5
zip
tar.gz

v0.2.0rc2

Merge branch 'TomWildenhain-Microsoft-tom/different_dist_colors' into…

… plugin/0.2

May 31, 2021
bfe664e
zip
tar.gz

v0.1.0

remove number of device (#127)

* remove number of device

Co-authored-by: Teng Gao <tegao@microsoft.com>

Mar 25, 2021
50c0f7c
zip
tar.gz

v0.1.0rc2

Merge pull request #84 from gaoteng-git/update_gpu_flag

split is_gpu_used into 3 switches

Feb 26, 2021
f367e7d
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ciflow/rocm/1286

v0.4.0

v0.3.1

v0.3.0

v0.2.1

v0.2.0

v0.2.0rc3

v0.2.0rc2

v0.1.0

v0.1.0rc2

Tags: pytorch/kineto