Tags · pytorch/kineto

ciflow/rocm/1416

Repair ROCm kernel stream attribution (#1416)

Summary:

**Problem**

ROCm Kineto traces can show kernels from independent HIP streams on the wrong stream, making unrelated worker activity look mixed together.

**Why**

ROCProfiler can deliver async kernel dispatch records where multiple kernels share one correlation and another runtime launch correlation is missing. Trusting correlation alone means Kineto can collapse or spread kernels onto the wrong stream. The repaired ROCm stream id is a 64-bit HIP stream value, and GPU user annotations also need to preserve that full value so Perfetto keeps annotations nested on the same lane as the kernels they cover.

**Fix**

Keep the HIP stream and demangled kernel name from runtime callbacks, carry dispatch thread and external stream data on async rows, and repair ambiguous duplicated kernel correlations only when runtime thread or canonical kernel name maps to exactly one stream. Preserve the full 64-bit resource id when creating generic GPU user annotations so `Stage::...` annotations stay aligned with repaired ROCm kernel lanes. The implementation limits overhead by building copy and kernel repair maps only for activity types requested by the trace, building the kernel-name map only when duplicated async kernel correlations are present, and tracking pushed stream correlations in cached kernel args instead of adding another thread-local hash set per kernel launch. If the evidence is ambiguous, Kineto preserves the original async stream instead of guessing.

Differential Revision: D104473177

Jun 2, 2026
0f8ef8c
zip
tar.gz

ciflow/rocm/1286

Expose comms_id in traces

Summary:
# Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P       
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  - pg_name — identifies the process group
  - seqNumber — per-PG operation counter, identifies which operation within the PG
  - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
- globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
  disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

 Changes by layer

 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  - Storage/retrieval of seqNumber and isP2P
  - Default values
  - End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  - Determinism across instances
  - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539