Skip to content

test deviceImpl And got error 'an illegal memory access was encountered' #355

@Qizhi697

Description

@Qizhi697

I set -D 1 -R 2 -V 24

# nccl-tests version 2.17.6 nccl-headers=22807 nccl-library=22807
# Collective test starting: alltoall_perf
# nThread 1 nGpus 1 minBytes 2097152 maxBytes 2097152 step: 0(bytes) warmup iters: 0 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  62767 on node0 device  0 [0000:19:00] NVIDIA H800
#  Rank  1 Group  0 Pid  62768 on node0 device  1 [0000:3b:00] NVIDIA H800
#  Rank  2 Group  0 Pid  62769 on node0 device  2 [0000:4c:00] NVIDIA H800
#  Rank  3 Group  0 Pid  62770 on node0 device  3 [0000:5d:00] NVIDIA H800
#  Rank  4 Group  0 Pid  62771 on node0 device  4 [0000:9b:00] NVIDIA H800
#  Rank  5 Group  0 Pid  62772 on node0 device  5 [0000:bb:00] NVIDIA H800
#  Rank  6 Group  0 Pid  62773 on node0 device  6 [0000:cb:00] NVIDIA H800
#  Rank  7 Group  0 Pid  62774 on node0 device  7 [0000:db:00] NVIDIA H800
#  Rank  8 Group  0 Pid  36788 on node1 device  0 [0000:19:00] NVIDIA H800
#  Rank  9 Group  0 Pid  36789 on node1 device  1 [0000:3b:00] NVIDIA H800
#  Rank 10 Group  0 Pid  36790 on node1 device  2 [0000:4c:00] NVIDIA H800
#  Rank 11 Group  0 Pid  36791 on node1 device  3 [0000:5d:00] NVIDIA H800
#  Rank 12 Group  0 Pid  36792 on node1 device  4 [0000:9b:00] NVIDIA H800
#  Rank 13 Group  0 Pid  36793 on node1 device  5 [0000:bb:00] NVIDIA H800
#  Rank 14 Group  0 Pid  36794 on node1 device  6 [0000:cb:00] NVIDIA H800
#  Rank 15 Group  0 Pid  36795 on node1 device  7 [0000:db:00] NVIDIA H800
NCCL version 2.28.7+cuda13.0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong                     
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)                             
     2097152         32768     float    none      -1 node0: Test CUDA failure common.cu:389 'an illegal memory access was encountered'
 .. node0 pid 62767: Test failure common.cu:519
 .. node0 pid 62767: Test failure common.cu:532
 .. node0 pid 62767: Test failure common.cu:704
 .. node0 pid 62767: Test failure alltoall.cu:348
 .. node0 pid 62767: Test failure common.cu:718
 .. node0 pid 62767: Test failure common.cu:1368

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions