Skip to content

Tags: pytorch/kineto

Tags

ciflow/rocm/1436

Toggle ciflow/rocm/1436's commit message
Add TypedMetadata to ROCm activities (#1436)

Summary:

Adds typed metadata producers for ROCm activities

Differential Revision: D108154773

ciflow/rocm/1432

Toggle ciflow/rocm/1432's commit message
Un-deselect some profiler tests after fixing them (#1432)

Summary:

These should have been fixed in pytorch/pytorch#186970

Differential Revision: D108287028

ciflow/rocm/1428

Toggle ciflow/rocm/1428's commit message
Deselect failing tests

ciflow/rocm/1423

Toggle ciflow/rocm/1423's commit message
Test CI

ciflow/rocm/1416

Toggle ciflow/rocm/1416's commit message
Repair ROCm kernel stream attribution (#1416)

Summary:

**Problem**

ROCm Kineto traces can show kernels from independent HIP streams on the wrong stream, making unrelated worker activity look mixed together.

**Why**

ROCProfiler can deliver async kernel dispatch records where multiple kernels share one correlation and another runtime launch correlation is missing. Trusting correlation alone means Kineto can collapse or spread kernels onto the wrong stream. The repaired ROCm stream id is a 64-bit HIP stream value, and GPU user annotations also need to preserve that full value so Perfetto keeps annotations nested on the same lane as the kernels they cover.

**Fix**

Keep the HIP stream and demangled kernel name from runtime callbacks, carry dispatch thread and external stream data on async rows, and repair ambiguous duplicated kernel correlations only when runtime thread or canonical kernel name maps to exactly one stream. Preserve the full 64-bit resource id when creating generic GPU user annotations so `Stage::...` annotations stay aligned with repaired ROCm kernel lanes. The implementation limits overhead by building copy and kernel repair maps only for activity types requested by the trace, building the kernel-name map only when duplicated async kernel correlations are present, and tracking pushed stream correlations in cached kernel args instead of adding another thread-local hash set per kernel launch. If the evidence is ambiguous, Kineto preserves the original async stream instead of guessing.

Differential Revision: D104473177

ciflow/rocm/1286

Toggle ciflow/rocm/1286's commit message
Expose comms_id in traces

Summary:
# Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P       
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  - pg_name — identifies the process group
  - seqNumber — per-PG operation counter, identifies which operation within the PG
  - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
- globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
  disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

 Changes by layer

 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  - Storage/retrieval of seqNumber and isP2P
  - Default values
  - End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  - Determinism across instances
  - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539

v0.4.0

Toggle v0.4.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
update release version to 0.4.0 (#561)

v0.3.1

Toggle v0.3.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Update version of plugin to 0.3.1

v0.3.0

Toggle v0.3.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Update version for tb_plugin 0.3.0

v0.2.1

Toggle v0.2.1's commit message
tb_plugin 0.2.1