Skip to content

NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:df54%ib0<33397> with error 4, opcode 32611, len 32611, vendor err 81 (Send) #928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
liuxingbo12138 opened this issue Jul 25, 2023 · 4 comments

Comments

@liuxingbo12138
Copy link

liuxingbo12138 commented Jul 25, 2023

when i run mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x NCCL_ALGO=Ring -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=trace -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --host 10.101.15.5,10.101.15.4,10.101.15.3,10.101.15.2,10.101.15.1 ./build/all_reduce_perf -b 8 -e 5G -f 2 -g 8
error log

H100-2:19989:20082 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:df78%ib0<53249> with error 4, opcode 32564, len 32564, vendor err 81 (Send)
H100-2:19989:20082 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-2:19989:20082 [1] NCCL INFO proxy.cc:679 -> 6
H100-2:19989:20082 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-1:9821:9920 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:df50%ib0<44413> with error 4, opcode 32613, len 32613, vendor err 81 (Send)
H100-1:9821:9920 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-1:9821:9920 [1] NCCL INFO proxy.cc:679 -> 6
H100-1:9821:9920 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-5:68738:69060 [3] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:9c:f5f8%ib0<39391> with error 4, opcode 32639, len 32639, vendor err 81 (Send)
H100-5:68738:69060 [3] NCCL INFO transport/net.cc:1008 -> 6
H100-5:68738:69060 [3] NCCL INFO proxy.cc:679 -> 6
H100-5:68738:69060 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-5:68738:69056 [2] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:9c:f5f8%ib0<48813> with error 4, opcode 32639, len 32639, vendor err 81 (Send)
H100-5:68738:69056 [2] NCCL INFO transport/net.cc:1008 -> 6
H100-5:68738:69056 [2] NCCL INFO proxy.cc:679 -> 6
H100-5:68738:69056 [2] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-4:18845:18947 [3] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:d06c%ib0<38449> with error 4, opcode 32704, len 32704, vendor err 81 (Send)
H100-4:18845:18947 [3] NCCL INFO transport/net.cc:1008 -> 6
H100-4:18845:18947 [3] NCCL INFO proxy.cc:679 -> 6
H100-4:18845:18947 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-4:18845:18941 [2] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:d06c%ib0<47987> with error 4, opcode 0, len 32704, vendor err 81 (Send)
H100-4:18845:18941 [2] NCCL INFO transport/net.cc:1008 -> 6
H100-4:18845:18941 [2] NCCL INFO proxy.cc:679 -> 6
H100-4:18845:18941 [2] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-5:68738:69058 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:9c:f5f8%ib0<48229> with error 4, opcode 32639, len 32639, vendor err 81 (Send)
H100-5:68738:69058 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-5:68738:69058 [1] NCCL INFO proxy.cc:679 -> 6
H100-5:68738:69058 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

H100-4:18845:18944 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:d06c%ib0<56789> with error 4, opcode 0, len 32704, vendor err 81 (Send)
H100-4:18845:18944 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-4:18845:18944 [1] NCCL INFO proxy.cc:679 -> 6
H100-4:18845:18944 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

dmesg have no error, when i start nvidia_peermem ,then will show the error log, I think this is a problem with the IB network,so , who can help me solve the problem, thanks

root@H100-5:/opt/att/copy_file/nccl-tests# nvidia-smi topo -m
\	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-39,80-119	0
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NODE	PIX	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-39,80-119	0
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-39,80-119	0
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	NODE	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	0-39,80-119	0
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	40-79,120-159	1
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	SYS	SYS	SYS	SYS	NODE	PIX	NODE	NODE	40-79,120-159	1
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	40-79,120-159	1
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	40-79,120-159	1
NIC0	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS		
NIC1	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS		
NIC2	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS		
NIC3	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	NODE	SYS	SYS	SYS	SYS		
NIC4	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	NODE	SYS	SYS	SYS	SYS		
NIC5	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	NODE	 X 	SYS	SYS	SYS	SYS		
NIC6	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE		
NIC7	SYS	SYS	SYS	SYS	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE		
NIC8	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE		
NIC9	SYS	SYS	SYS	SYS	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
@KaimingOuyang
Copy link
Collaborator

It seems you are using the wrong HCAs. Can you provide me the output of ibstat? Or can you tell me which NICs you expect to use?

@sjeaugey
Copy link
Member

IB error 4 is usually due to ACS being enabled and breaking GPU Direct RDMA protocol.

You can confirm that by setting NCCL_NET_GDR_LEVEL=0.

@HelloWordPiaochen
Copy link

@liuxingbo12138 , Have you solved this problem? How?

@liuxingbo12138
Copy link
Author

@liuxingbo12138 , Have you solved this problem? How?

emm,no

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants