Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers

rranjan1 · April 15, 2024, 1:27am

I am using accelerate with deepspeed stage 1. I get the following error while saving the model:

NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(Seq
Num=913, OpType=ALLREDUCE, NumelIn=499133440, NumelOut=499133440, Timeout(ms)=600000) ran for 600489 milliseconds before timing out.

I am not sure why the timeout is 600000ms. I tried to set the timeout in the code using InitProcessGroupKwargs

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=3600))
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
project_config=accelerator_project_config,
kwargs_handlers=[kwargs]
)

But this did not change the timeout at all. How can I increase the current timeout from 600s to 3600s ?

accelerate verison: 0.29.2

marcsun13 · April 15, 2024, 11:58am

Hi @rranjan1, could you check if the following ?

import time
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from torch import tensor

kwargs = [InitProcessGroupKwargs(timeout=timedelta(seconds=10))]
accelerator = Accelerator(kwargs_handlers=kwargs)

if accelerator.is_main_process:
    t = tensor(0).to(accelerator.device)
    time.sleep(8)
else:
    t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()

print("All called!")

It should fail since the timeout is 4. If you change it to 10, it should works. I just want to see if the timeout is indeed changed as expected.

rranjan1 · April 15, 2024, 4:02pm

I checked this with different timeout=timedelta(seconds=TK) value ranging from 1 to 10. The code never fails for any of these values.

JiaxuZ · February 24, 2025, 1:47am

I meet the same issue. Have you solved it?

John6666 · February 24, 2025, 5:28pm

Maybe ongoing issue.

github.com/NVIDIA/nccl

NCCL timeout issue

opened 05:57PM - 19 Aug 24 UTC

wd255

Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCC…L(SeqNum=80078...) ran for 600026 milliseconds before timing out The code I'm running is a VQGAN training script. Parallelism is done with accelerate. This issue happens at the end of each epoch. We believe it's the problem of the environment setup, as the code could run on another machine without any problem environment: Ubuntu 20 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Fri_Jun_14_16:34:21_PDT_2024 Cuda compilation tools, release 12.6, V12.6.20 Build cuda_12.6.r12.6/compiler.34431801_0 full log ``` [rank7]:[E816 01:45:40.183009389 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. [rank7]:[E816 01:45:40.184373931 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80078, last completed NCCL work: 80077. [rank3]:[E816 01:45:40.212150935 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [rank9]:[E816 01:45:40.212150385 ProcessGroupNCCL.cpp:607] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [rank9]:[E816 01:45:40.212958430 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 9] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank3]:[E816 01:45:40.212958690 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank8]:[E816 01:45:40.265372190 ProcessGroupNCCL.cpp:607] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank8]:[E816 01:45:40.266489823 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 8] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank1]:[E816 01:45:40.278487367 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600079 milliseconds before timing out. [rank1]:[E816 01:45:40.279310302 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank5]:[E816 01:45:40.285038686 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank5]:[E816 01:45:40.286293758 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank2]:[E816 01:45:40.292277610 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [rank0]:[E816 01:45:40.292277580 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. [rank0]:[E816 01:45:40.293076425 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank2]:[E816 01:45:40.293076395 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. [rank6]:[E816 01:45:40.296233705 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=ALLREDUCE, NumelIn=2106881, NumelOut=2106881, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. [rank6]:[E816 01:45:40.298295572 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 80078, last enqueued NCCL work: 80087, last completed NCCL work: 80077. 0%| | 0/8008 [10:00<?, ?it/s] [rank7]: Traceback (most recent call last): [rank7]: File "/home/duanwei/ML/magvit_imagenet.py", line 480, in <module> [rank7]: for batch_idx, (real, _) in zip(pbar, dataloader): [rank7]: File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/data_loader.py", line 447, in __iter__ [rank7]: synchronize_rng_states(self.rng_types, self.synchronized_generator) [rank7]: File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/utils/random.py", line 132, in synchronize_rng_states [rank7]: synchronize_rng_state(RNGType(rng_type), generator=generator) [rank7]: File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/utils/random.py", line 127, in synchronize_rng_state [rank7]: generator.set_state(rng_state) [rank7]: RuntimeError: Invalid mt19937 state [rank7]:[E816 01:45:41.635484056 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 7] Timeout at NCCL work: 80078, last enqueued NCCL work: 80078, last completed NCCL work: 80077. [rank7]:[E816 01:45:41.635510825 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank7]:[E816 01:45:41.635516125 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down. [rank7]:[E816 01:45:41.636462549 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b9a40e0b2 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8b9a414af3 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b9a416edc in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 (default_pg) Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078, OpType=BROADCAST, NumelIn=5056, NumelOut=5056, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8b9a40e0b2 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8b9a414af3 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8b9a416edc in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538439675/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8b9912af86 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe3ec34 (0x7f8b9a096c34 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xd3b55 (0x7f8bea0f0b55 in /root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x7f8bfaf3eac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x126850 (0x7f8bfafd0850 in /lib/x86_64-linux-gnu/libc.so.6) W0816 01:45:41.681000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340257 closing signal SIGTERM W0816 01:45:41.682000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340258 closing signal SIGTERM W0816 01:45:41.682000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340259 closing signal SIGTERM W0816 01:45:41.683000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340260 closing signal SIGTERM W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340261 closing signal SIGTERM W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340262 closing signal SIGTERM W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340263 closing signal SIGTERM W0816 01:45:41.684000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340265 closing signal SIGTERM W0816 01:45:41.685000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 340266 closing signal SIGTERM E0816 01:45:43.466000 140650667333440 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 7 (pid: 340264) of binary: /root/miniconda3/envs/magvit/bin/python3.12 Traceback (most recent call last): File "/root/miniconda3/envs/magvit/bin/accelerate", line 10, in <module> sys.exit(main()) ^^^^^^ File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/magvit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================= magvit_imagenet.py FAILED --- Failures: <NO_OTHER_FAILURES> --- Root Cause (first observed failure): [0]: time : 2024-08-16_01:45:41 host : kansas3 rank : 7 (local_rank: 7) exitcode : -6 (pid: 340264) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 340264 ======================================================= ``` Can anyone help me with it?

nasos10 · March 3, 2025, 10:40am

I am facing the same problem, trying to do accelerator.save_state() (i call it with all processes, as I have seen this is the way to do this), with FSDP wrapping. It timeouts. Have you guys found any workarounds?

Topic		Replies	Views
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6259	July 31, 2023
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	2	2368	June 20, 2025
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2572	July 31, 2023
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14502	February 9, 2023
Early stopping for eval loss causes timeout? 🤗Accelerate	10	1711	June 20, 2024

Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers

Related topics