Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCC…L(SeqNum=80078...) ran for 600026 milliseconds before timing out
The code I'm running is a VQGAN training script. Parallelism is done with accelerate. This issue happens at the end of each epoch. We believe it's the problem of the environment setup, as the code could run on another machine without any problem
environment: Ubuntu 20
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0
full log
```
[rank7]:[E816 01:45:40.183009389 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
[rank7]:[E816 01:45:40.184373931 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80078, last completed NCCL work: 80077.
[rank3]:[E816 01:45:40.212150935 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank9]:[E816 01:45:40.212150385 ProcessGroupNCCL.cpp:607] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank9]:[E816 01:45:40.212958430 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 9] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank3]:[E816 01:45:40.212958690 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank8]:[E816 01:45:40.265372190 ProcessGroupNCCL.cpp:607] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
[rank8]:[E816 01:45:40.266489823 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 8] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank1]:[E816 01:45:40.278487367 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank1]:[E816 01:45:40.279310302 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank5]:[E816 01:45:40.285038686 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank5]:[E816 01:45:40.286293758 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank2]:[E816 01:45:40.292277610 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600093 milliseconds before timing out.
[rank0]:[E816 01:45:40.292277580 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
[rank0]:[E816 01:45:40.293076425 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank2]:[E816 01:45:40.293076395 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
[rank6]:[E816 01:45:40.296233705 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank6]:[E816 01:45:40.298295572 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077.
0%| | 0/8008 [10:00<?, ?it/s]
[rank7]: Traceback (most recent call last):
[rank7]: File "/home/duanwei/ML/magvit_imagenet.py", line 480, in <module>
[rank7]: for batch_idx, (real, _) in zip(pbar, dataloader):
[rank7]: File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/data_loader.py", line 447, in __iter__
[rank7]: synchronize_rng_states(self.rng_types, self.synchronized_generator)
[rank7]: File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/utils/random.py", line 132, in synchronize_rng_states
[rank7]: synchronize_rng_state(RNGType(rng_type), generator=generator)
[rank7]: File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/utils/random.py", line 127, in synchronize_rng_state
[rank7]: generator.set_state(rng_state)
[rank7]: RuntimeError: Invalid mt19937 state
[rank7]:[E816 01:45:41.635484056 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 7] Timeout at NCCL work: 80078, last enqueued NCCL work: 80078, last completed NCCL work: 80077.
[rank7]:[E816 01:45:41.635510825 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E816 01:45:41.635516125 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E816 01:45:41.636462549 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b9a40e0b2 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8b9a414af3 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b9a416edc in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b9a40e0b2 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8b9a414af3 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b9a416edc in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe3ec34 (0x7f8b9a096c34 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6)
W0816 01:45:41.681000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340257 closing signal SIGTERM
W0816 01:45:41.682000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340258 closing signal SIGTERM
W0816 01:45:41.682000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340259 closing signal SIGTERM
W0816 01:45:41.683000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340260 closing signal SIGTERM
W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340261 closing signal SIGTERM
W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340262 closing signal SIGTERM
W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340263 closing signal SIGTERM
W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340265 closing signal SIGTERM
W0816 01:45:41.685000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340266 closing signal SIGTERM
E0816 01:45:43.466000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 7 (pid: 340264) of binary: /root/miniconda3/envs/magvit/bin/python3.12
Traceback (most recent call last):
File "/root/miniconda3/envs/magvit/bin/accelerate", line 10, in <module>
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
magvit_imagenet.py FAILED
---
Failures:
<NO_OTHER_FAILURES>
---
Root Cause (first observed failure):
[0]:
time : 2024-08-16_01:45:41
host : kansas3
rank : 7 (local_rank: 7)
exitcode : -6 (pid: 340264)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 340264
=======================================================
```
Can anyone help me with it?