You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when i run mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x NCCL_ALGO=Ring -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=trace -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --host 10.101.15.5,10.101.15.4,10.101.15.3,10.101.15.2,10.101.15.1 ./build/all_reduce_perf -b 8 -e 5G -f 2 -g 8
error log
H100-2:19989:20082 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:df78%ib0<53249> with error 4, opcode 32564, len 32564, vendor err 81 (Send)
H100-2:19989:20082 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-2:19989:20082 [1] NCCL INFO proxy.cc:679 -> 6
H100-2:19989:20082 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-1:9821:9920 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:df50%ib0<44413> with error 4, opcode 32613, len 32613, vendor err 81 (Send)
H100-1:9821:9920 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-1:9821:9920 [1] NCCL INFO proxy.cc:679 -> 6
H100-1:9821:9920 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-5:68738:69060 [3] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:9c:f5f8%ib0<39391> with error 4, opcode 32639, len 32639, vendor err 81 (Send)
H100-5:68738:69060 [3] NCCL INFO transport/net.cc:1008 -> 6
H100-5:68738:69060 [3] NCCL INFO proxy.cc:679 -> 6
H100-5:68738:69060 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-5:68738:69056 [2] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:9c:f5f8%ib0<48813> with error 4, opcode 32639, len 32639, vendor err 81 (Send)
H100-5:68738:69056 [2] NCCL INFO transport/net.cc:1008 -> 6
H100-5:68738:69056 [2] NCCL INFO proxy.cc:679 -> 6
H100-5:68738:69056 [2] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-4:18845:18947 [3] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:d06c%ib0<38449> with error 4, opcode 32704, len 32704, vendor err 81 (Send)
H100-4:18845:18947 [3] NCCL INFO transport/net.cc:1008 -> 6
H100-4:18845:18947 [3] NCCL INFO proxy.cc:679 -> 6
H100-4:18845:18947 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-4:18845:18941 [2] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:d06c%ib0<47987> with error 4, opcode 0, len 32704, vendor err 81 (Send)
H100-4:18845:18941 [2] NCCL INFO transport/net.cc:1008 -> 6
H100-4:18845:18941 [2] NCCL INFO proxy.cc:679 -> 6
H100-4:18845:18941 [2] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-5:68738:69058 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:9c:f5f8%ib0<48229> with error 4, opcode 32639, len 32639, vendor err 81 (Send)
H100-5:68738:69058 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-5:68738:69058 [1] NCCL INFO proxy.cc:679 -> 6
H100-5:68738:69058 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
H100-4:18845:18944 [1] transport/net_ib.cc:1295 NCCL WARN NET/IB : Got completion from peer fe80::966d:ae03:8b:d06c%ib0<56789> with error 4, opcode 0, len 32704, vendor err 81 (Send)
H100-4:18845:18944 [1] NCCL INFO transport/net.cc:1008 -> 6
H100-4:18845:18944 [1] NCCL INFO proxy.cc:679 -> 6
H100-4:18845:18944 [1] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
dmesg have no error, when i start nvidia_peermem ,then will show the error log, I think this is a problem with the IB network,so , who can help me solve the problem, thanks
when i run
mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x NCCL_ALGO=Ring -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=trace -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --host 10.101.15.5,10.101.15.4,10.101.15.3,10.101.15.2,10.101.15.1 ./build/all_reduce_perf -b 8 -e 5G -f 2 -g 8
error log
dmesg have no error, when i start nvidia_peermem ,then will show the error log, I think this is a problem with the IB network,so , who can help me solve the problem, thanks
The text was updated successfully, but these errors were encountered: