-
Notifications
You must be signed in to change notification settings - Fork 337
Open
Description
I set -D 1 -R 2 -V 24
# nccl-tests version 2.17.6 nccl-headers=22807 nccl-library=22807
# Collective test starting: alltoall_perf
# nThread 1 nGpus 1 minBytes 2097152 maxBytes 2097152 step: 0(bytes) warmup iters: 0 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 62767 on node0 device 0 [0000:19:00] NVIDIA H800
# Rank 1 Group 0 Pid 62768 on node0 device 1 [0000:3b:00] NVIDIA H800
# Rank 2 Group 0 Pid 62769 on node0 device 2 [0000:4c:00] NVIDIA H800
# Rank 3 Group 0 Pid 62770 on node0 device 3 [0000:5d:00] NVIDIA H800
# Rank 4 Group 0 Pid 62771 on node0 device 4 [0000:9b:00] NVIDIA H800
# Rank 5 Group 0 Pid 62772 on node0 device 5 [0000:bb:00] NVIDIA H800
# Rank 6 Group 0 Pid 62773 on node0 device 6 [0000:cb:00] NVIDIA H800
# Rank 7 Group 0 Pid 62774 on node0 device 7 [0000:db:00] NVIDIA H800
# Rank 8 Group 0 Pid 36788 on node1 device 0 [0000:19:00] NVIDIA H800
# Rank 9 Group 0 Pid 36789 on node1 device 1 [0000:3b:00] NVIDIA H800
# Rank 10 Group 0 Pid 36790 on node1 device 2 [0000:4c:00] NVIDIA H800
# Rank 11 Group 0 Pid 36791 on node1 device 3 [0000:5d:00] NVIDIA H800
# Rank 12 Group 0 Pid 36792 on node1 device 4 [0000:9b:00] NVIDIA H800
# Rank 13 Group 0 Pid 36793 on node1 device 5 [0000:bb:00] NVIDIA H800
# Rank 14 Group 0 Pid 36794 on node1 device 6 [0000:cb:00] NVIDIA H800
# Rank 15 Group 0 Pid 36795 on node1 device 7 [0000:db:00] NVIDIA H800
NCCL version 2.28.7+cuda13.0
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
2097152 32768 float none -1 node0: Test CUDA failure common.cu:389 'an illegal memory access was encountered'
.. node0 pid 62767: Test failure common.cu:519
.. node0 pid 62767: Test failure common.cu:532
.. node0 pid 62767: Test failure common.cu:704
.. node0 pid 62767: Test failure alltoall.cu:348
.. node0 pid 62767: Test failure common.cu:718
.. node0 pid 62767: Test failure common.cu:1368Metadata
Metadata
Assignees
Labels
No labels