Skip to content

Tags: pytorch/kineto

Tags

ciflow/rocm/1286

Toggle ciflow/rocm/1286's commit message
Expose comms_id in traces

Summary:
# Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P       
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  - pg_name — identifies the process group
  - seqNumber — per-PG operation counter, identifies which operation within the PG
  - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
- globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
  disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

 Changes by layer

 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  - Storage/retrieval of seqNumber and isP2P
  - Default values
  - End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  - Determinism across instances
  - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539

v0.4.0

Toggle v0.4.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
update release version to 0.4.0 (#561)

v0.3.1

Toggle v0.3.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Update version of plugin to 0.3.1

v0.3.0

Toggle v0.3.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Update version for tb_plugin 0.3.0

v0.2.1

Toggle v0.2.1's commit message
tb_plugin 0.2.1

v0.2.0

Toggle v0.2.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Update README.md (#288)

v0.2.0rc3

Toggle v0.2.0rc3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add communication recommendation for tb_plugin (#273)

v0.2.0rc2

Toggle v0.2.0rc2's commit message
Merge branch 'TomWildenhain-Microsoft-tom/different_dist_colors' into…

… plugin/0.2

v0.1.0

Toggle v0.1.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
remove number of device (#127)

* remove number of device

Co-authored-by: Teng Gao <tegao@microsoft.com>

v0.1.0rc2

Toggle v0.1.0rc2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Merge pull request #84 from gaoteng-git/update_gpu_flag

split is_gpu_used into 3 switches