Skip to content

GRPO stuck with NCCL error #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JoeyXuquant11 opened this issue Apr 24, 2025 · 2 comments
Open

GRPO stuck with NCCL error #620

JoeyXuquant11 opened this issue Apr 24, 2025 · 2 comments

Comments

@JoeyXuquant11
Copy link

` ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, Numel
Out=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
[rank2]:[E424 01:11:21.160832730 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpTy
pe=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
[rank3]:[E424 01:11:21.160815008 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpTy
pe=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800011 milliseconds before timing out.
[rank6]:[E424 01:11:21.160836576 ProcessGroupNCCL.cpp:629] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpTy
pe=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
[rank5]:[E424 01:11:21.160854228 ProcessGroupNCCL.cpp:629] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpTy
pe=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
[rank4]:[E424 01:11:21.160857258 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpTy
pe=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800011 milliseconds before timing out.
[rank5]:[E424 01:11:22.637952327 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 5] failure detected by watchdog at work
sequence id: 4 PG status: last enqueued work: 4, last completed work: 3
[rank1]:[E424 01:11:22.637966658 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work
sequence id: 4 PG status: last enqueued work: 4, last completed work: 3
[rank3]:[E424 01:11:22.637953242 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work
sequence id: 4 PG status: last enqueued work: 4, last completed work: 3
[rank6]:[E424 01:11:22.637975078 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 6] failure detected by watchdog at work
sequence id: 4 PG status: last enqueued work: 4, last completed work: 3
[rank2]:[E424 01:11:22.637980579 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work
sequence id: 4 PG status: last enqueued work: 4, last completed work: 3
[rank4]:[E424 01:11:22.637979078 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 4] failure detected by watchdog at work
sequence id: 4 PG status: last enqueued work: 4, last completed work: 3
[rank5]:[E424 01:11:22.640176215 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightReco
rder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E424 01:11:22.640181196 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightReco
rder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E424 01:11:22.640195644 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightReco
rder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank6]:[E424 01:11:22.640198032 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightReco
rder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank4]:[E424 01:11:22.640207844 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightReco
rder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E424 01:11:22.640209706 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightReco
rder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E424 01:17:25.958856093 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronou
s nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E424 01:17:25.958866191 ProcessGroupNCCL.cpp:681] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronou
s nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E424 01:17:25.958870106 ProcessGroupNCCL.cpp:681] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronou
s nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E424 01:17:25.958925421 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E424 01:17:25.958934996 ProcessGroupNCCL.cpp:695] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E424 01:17:25.958912844 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronou
s nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E424 01:17:25.958921825 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronou
s nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E424 01:17:25.958948736 ProcessGroupNCCL.cpp:695] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E424 01:17:25.958938407 ProcessGroupNCCL.cpp:681] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronou
s nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E424 01:17:25.958978410 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E424 01:17:25.958989710 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E424 01:17:25.959026258 ProcessGroupNCCL.cpp:695] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E424 01:17:25.031948957 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminat
ed with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112,
Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0c4756c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f0bf581bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f0bf581d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f0bf581e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f0c47c5d5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f0c5dba2ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f0c5dc34850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E424 01:17:25.031989005 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminat
ed with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112,
Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb41016c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7fb3be41bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7fb3be41d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb3be41e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fb4105ee5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7fb42672dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb4267bf850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E424 01:17:25.031967557 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminat
ed with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112,
Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f15b276c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f1560a1bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f1560a1d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f1560a1e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f15b2eab5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f15c8ddcac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f15c8e6e850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E424 01:17:25.031978332 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminat
ed with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112,
Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f298896c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f2936c1bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2936c1d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2936c1e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f29890765c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f299efa6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f299f038850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E424 01:17:25.031991155 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminat
ed with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112,
Timeout(ms)=1800000) ran for 1800011 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5a9616c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f5a4441bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f5a4441d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5a4441e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f5a965ee5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f5aac72bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f5aac7bd850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E424 01:17:25.032013349 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminat
ed with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112,
Timeout(ms)=1800000) ran for 1800011 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8bef96c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f8b9d81bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f8b9d81d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8b9d81e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8befd9b5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f8c05cebac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f8c05d7d850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendErrorterminate called after throwing an instance of ''
terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instan
ce of 'c10::DistBackendErrorc10::DistBackendError'
terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendError'
c10::DistBackendError'
'
'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught colle
ctive operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800011 millisecond
s before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5a9616c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f5a4441bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f5a4441d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f5a4441e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f5a965ee5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f5aac72bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f5aac7bd850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5a9616c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f5a440796fc in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/
libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f5a965ee5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #3: + 0x94ac3 (0x7f5aac72bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f5aac7bd850 in /lib/x86_64-linux-gnu/libc.so.6)
what():
[PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective opera
tion timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before ti
ming out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f15b276c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f1560a1bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f1560a1d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f1560a1e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f15b2eab5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f15c8ddcac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f15c8e6e850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f15b276c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f15606796fc in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/
libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f15b2eab5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #3: + 0x94ac3 (0x7f15c8ddcac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f15c8e6e850 in /lib/x86_64-linux-gnu/libc.so.6)
what(): what(): what(): [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2]
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran f
or 1800010 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f298896c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f2936c1bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2936c1d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2936c1e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f29890765c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f299efa6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f299f038850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f298896c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f29368796fc in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/
libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f29890765c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #3: + 0x94ac3 (0x7f299efa6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f299f038850 in /lib/x86_64-linux-gnu/libc.so.6)

[PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective opera
tion timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before ti
ming out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb41016c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7fb3be41bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7fb3be41d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb3be41e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7fb4105ee5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7fb42672dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb4267bf850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb41016c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7fb3be0796fc in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/
libtorch_cuda.so)
frame #2: + 0x145c0 (0x7fb4105ee5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #3: + 0x94ac3 (0x7fb42672dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fb4267bf850 in /lib/x86_64-linux-gnu/libc.so.6)
[PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective opera
tion timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800010 milliseconds before ti
ming out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0c4756c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f0bf581bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f0bf581d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f0bf581e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f0c47c5d5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f0c5dba2ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f0c5dc34850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0c4756c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f0bf54796fc in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/
libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f0c47c5d5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #3: + 0x94ac3 (0x7f0c5dba2ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f0c5dc34850 in /lib/x86_64-linux-gnu/libc.so.6)

what():

[PG ID 0 PG GUID 0(default_pg) Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective opera
tion timeout: WorkNCCL(SeqNum=4, OpType=BROADCAST, NumelIn=112, NumelOut=112, Timeout(ms)=1800000) ran for 1800011 milliseconds before ti
ming out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8bef96c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x
7f8b9d81bc74 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f8b9d81d7d0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site
-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8b9d81e6ed in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f8befd9b5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #5: + 0x94ac3 (0x7f8c05cebac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f8c05d7d850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8bef96c1b6 in /home/tanhuajie/miniconda3/envs/openr1/lib/python
3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f8b9d4796fc in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/
libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f8befd9b5c0 in /home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/lib/l
ibtorch.so)
frame #3: + 0x94ac3 (0x7f8c05cebac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7f8c05d7d850 in /lib/x86_64-linux-gnu/libc.so.6)

W0424 01:17:26.433000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 541326 closing signal SI
GTERM
W0424 01:17:26.442000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 541327 closing signal SI
GTERM
W0424 01:17:26.445000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 541328 closing signal SI
GTERM
W0424 01:17:26.447000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 541329 closing signal SI
GTERM
W0424 01:17:26.447000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 541330 closing signal SI
GTERM
W0424 01:17:26.448000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 541331 closing signal SI
GTERM
E0424 01:17:26.952000 521279 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 6 (pid
: 541332) of binary: /home/tanhuajie/miniconda3/envs/openr1/bin/python
Traceback (most recent call last):
File "/home/tanhuajie/miniconda3/envs/openr1/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
deepspeed_launcher(args)
File "/home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launch
er
distrib_run.run(args)
File "/home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tanhuajie/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/open_r1/grpo.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-04-24_01:17:26
host : p-phy-ctyun-gz-a800-node-prod-200-103
rank : 6 (local_rank: 6)
exitcode : -6 (pid: 541332)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 541332
=======================================================`

Why am I encountering this error? It first occurred after training for about 20 steps, and now, upon restarting, it fails even within the first step

@YuanEric88
Copy link

@JoeyXuquant11 Have you solved it?

@lewtun
Copy link
Member

lewtun commented May 14, 2025

Hello @JoeyXuquant11 can you please provide the command you used to launch the training, along with the output from trl env?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants