Tags: pytorch/kineto
Tags
Add TypedMetadata to ROCm activities (#1436) Summary: Adds typed metadata producers for ROCm activities Differential Revision: D108154773
Un-deselect some profiler tests after fixing them (#1432) Summary: These should have been fixed in pytorch/pytorch#186970 Differential Revision: D108287028
Repair ROCm kernel stream attribution (#1416) Summary: **Problem** ROCm Kineto traces can show kernels from independent HIP streams on the wrong stream, making unrelated worker activity look mixed together. **Why** ROCProfiler can deliver async kernel dispatch records where multiple kernels share one correlation and another runtime launch correlation is missing. Trusting correlation alone means Kineto can collapse or spread kernels onto the wrong stream. The repaired ROCm stream id is a 64-bit HIP stream value, and GPU user annotations also need to preserve that full value so Perfetto keeps annotations nested on the same lane as the kernels they cover. **Fix** Keep the HIP stream and demangled kernel name from runtime callbacks, carry dispatch thread and external stream data on async rows, and repair ambiguous duplicated kernel correlations only when runtime thread or canonical kernel name maps to exactly one stream. Preserve the full 64-bit resource id when creating generic GPU user annotations so `Stage::...` annotations stay aligned with repaired ROCm kernel lanes. The implementation limits overhead by building copy and kernel repair maps only for activity types requested by the trace, building the kernel-name map only when duplicated async kernel correlations are present, and tracking pushed stream correlations in cached kernel args instead of adding another thread-local hash set per kernel launch. If the evidence is ambiguous, Kineto preserves the original async stream instead of guessing. Differential Revision: D104473177
Expose comms_id in traces Summary: # Context Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance. How comms_id is computed comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize) - pg_name — identifies the process group - seqNumber — per-PG operation counter, identifies which operation within the PG - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters - globalRankStart, globalRankStride, worldSize — encodes the communicator topology, disambiguating cases where one PG creates multiple communicators (e.g., comm splits) Changes by layer 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack. 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map. 3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers. 4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering: - Storage/retrieval of seqNumber and isP2P - Default values - End-to-end: comms_id appears in saveNcclMeta() output with correct hash - Determinism across instances - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies Differential Revision: D95659539
PreviousNext