-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Description
Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0
ibv_devinfo:
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 16.28.2006
node_guid: 0c42:a103:0023:ac92
sys_image_guid: 0c42:a103:0023:ac92
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
This issue like https://github.com/NVIDIA/nccl/issues/214, but i have verified that there is no ACS enabled on either of the nodes.
The following are the command and the error log:
mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0
# nThread 1 nGpus 1 minBytes 9 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 20109 on machine-17 device 0 [0x1a] Tesla V100-SXM2-32GB
# Rank 1 Pid 70497 on machine-19 device 0 [0x1a] Tesla V100-SXM2-32GB
machine-17:20109:20109 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-17:20109:20109 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-17:20109:20109 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO Using network IB
machine-19:70497:70497 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-19:70497:70497 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-19:70497:70497 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO Using network IB
NCCL version 2.8.3+cuda10.2
machine-19:70497:70509 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
machine-19:70497:70509 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-17:20109:20124 [0] NCCL INFO Channel 00/02 : 0 1
machine-17:20109:20124 [0] NCCL INFO Channel 01/02 : 0 1
machine-17:20109:20124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
machine-17:20109:20124 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-19:70497:70509 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-17:20109:20124 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Connected all rings
machine-17:20109:20124 [0] NCCL INFO Connected all trees
machine-17:20109:20124 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-17:20109:20124 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-19:70497:70509 [0] NCCL INFO Connected all rings
machine-19:70497:70509 [0] NCCL INFO Connected all trees
machine-19:70497:70509 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-19:70497:70509 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-17:20109:20124 [0] NCCL INFO comm 0x468ee30 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
machine-19:70497:70509 [0] NCCL INFO comm 0x3cfcf70 rank 1 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
machine-17:20109:20109 [0] NCCL INFO Launch mode Parallel
machine-19:70497:70516 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-19:70497:70516 [0] NCCL INFO include/net.h:28 -> 2
machine-19:70497:70516 [0] NCCL INFO transport/net.cc:404 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:320 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-17:20109:20140 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-17:20109:20140 [0] NCCL INFO include/net.h:28 -> 2
machine-17:20109:20140 [0] NCCL INFO transport/net.cc:404 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:320 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-19: Test NCCL failure common.cu:346 'unhandled system error'
.. machine-19: Test failure common.cu:395
.. machine-19: Test failure common.cu:494
.. machine-19: Test failure all_reduce.cu:103
.. machine-19: Test failure common.cu:520
.. machine-19: Test failure common.cu:844
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[51283,1],1]
Exit code: 3
Metadata
Metadata
Assignees
Labels
No labels