Skip to content

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129 #426

@NHZlX

Description

@NHZlX

Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0

ibv_devinfo:

hca_id:	mlx5_bond_0
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0c42:a103:0023:ac92
	sys_image_guid:			0c42:a103:0023:ac92
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000012
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

This issue like https://github.com/NVIDIA/nccl/issues/214, but i have verified that there is no ACS enabled on either of the nodes.

The following are the command and the error log:

mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0

# nThread 1 nGpus 1 minBytes 9 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  20109 on machine-17 device  0 [0x1a] Tesla V100-SXM2-32GB
#   Rank  1 Pid  70497 on machine-19 device  0 [0x1a] Tesla V100-SXM2-32GB
machine-17:20109:20109 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-17:20109:20109 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-17:20109:20109 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO Using network IB
machine-19:70497:70497 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-19:70497:70497 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-19:70497:70497 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO Using network IB
NCCL version 2.8.3+cuda10.2
machine-19:70497:70509 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
machine-19:70497:70509 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-17:20109:20124 [0] NCCL INFO Channel 00/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Channel 01/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
machine-17:20109:20124 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-19:70497:70509 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-17:20109:20124 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Connected all rings
machine-17:20109:20124 [0] NCCL INFO Connected all trees
machine-17:20109:20124 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-17:20109:20124 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-19:70497:70509 [0] NCCL INFO Connected all rings
machine-19:70497:70509 [0] NCCL INFO Connected all trees
machine-19:70497:70509 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-19:70497:70509 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-17:20109:20124 [0] NCCL INFO comm 0x468ee30 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
machine-19:70497:70509 [0] NCCL INFO comm 0x3cfcf70 rank 1 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
machine-17:20109:20109 [0] NCCL INFO Launch mode Parallel

machine-19:70497:70516 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-19:70497:70516 [0] NCCL INFO include/net.h:28 -> 2
machine-19:70497:70516 [0] NCCL INFO transport/net.cc:404 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:320 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]

machine-17:20109:20140 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-17:20109:20140 [0] NCCL INFO include/net.h:28 -> 2
machine-17:20109:20140 [0] NCCL INFO transport/net.cc:404 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:320 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-19: Test NCCL failure common.cu:346 'unhandled system error'
 .. machine-19: Test failure common.cu:395
 .. machine-19: Test failure common.cu:494
 .. machine-19: Test failure all_reduce.cu:103
 .. machine-19: Test failure common.cu:520
 .. machine-19: Test failure common.cu:844
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51283,1],1]
  Exit code:    3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions