NCCL Timeout Accelerate Load From Checkpoint

zpn · March 16, 2023, 3:36pm

Hi all, I had a quick question. I’m having issues when I try to resume from a checkpoint when using an IterableDataset. It will get to the first accelerate.backward step, then fail with an NCCL timeout like

[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=559, OpType=REDUCE, Timeout(ms)=1800000) ran for 1800116 milliseconds before timing out.

I even added

timeout = InitProcessGroupKwargs(timeout=timedelta(seconds=1800 * 2))
accelerator = Accelerator(
        log_with="wandb",
        kwargs_handlers=[timeout]
        )

and set NCCL_ASYNC_ERROR_HANDLING=1 but still see the timeout after 1800 seconds.

Roughly, this is how I load the checkpoint and then skip batches

if config["train_args"]["resume_from_checkpoint"]:
        # Loads the DeepSpeed checkpoint from the specified path
        accelerator.print(f"Resumed from checkpoint: {config['train_args']['resume_from_checkpoint']}")
        accelerator.load_state(config["train_args"]["resume_from_checkpoint"])
        path = os.path.basename(config["train_args"]["resume_from_checkpoint"])
        training_difference = os.path.splitext(path)[0]

        resume_step = int(training_difference.replace("step_", ""))
    else:
        resume_step = -1

        
    accelerator.wait_for_everyone()

    progress_bar = tqdm(range(max_steps), disable=not accelerator.is_local_main_process)

if config["train_args"]["resume_from_checkpoint"] and resume_step is not None:
    # We need to skip steps until we reach the resumed step
    train_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step)
    total_steps += resume_step
    progress_bar.update(resume_step)
    accelerator.print(f"Resuming training from step {resume_step}")
    torch.distributed.barrier()

accelerator.print(f"Resumed training on rank {accelerator.state.process_index}")

for batch in train_dataloader:
    loss = model(**batch)
    accelerator.backward(loss)

Do you have any suggestions on the best path forward? I’m scratching my head here and not sure exactly what to do.

jlmckins · June 19, 2025, 10:21pm

I also have the same issue. Resuming a long training run from checkpoint is timing out after 600000 ms. No matter what I try, I cannot change avoid the “ProcessGroupNCCL watchdog hang”. How can I increase that value when running accelerate launch?

John6666 · June 20, 2025, 7:40am

It seems that this issue has not been resolved… There may be a workaround.

github.com/huggingface/accelerate

About Timeout when use Multi-gpu training

opened 06:00AM - 12 Apr 22 UTC

closed 03:06PM - 02 Jun 22 UTC

macheng6

When I used the single-node multi-GPU mode to train, a timeout error was reporte…d. The strange thing is that for the first few epochs, the code works fine. This error was reported after the end of a step eval in the middle. The reported error message is： [E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808499 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error'

github.com/huggingface/accelerate

Multi GPU process stuck

opened 02:59AM - 21 Nov 23 UTC

closed 04:06AM - 22 Nov 23 UTC

xwyzsn

### System Info ```Shell - `Accelerate` version: 0.24.1 - Platform: Linux-3.10….0-1160.el7.x86_64-x86_64-with-glibc2.27 - Python version: 3.10.11 - Numpy version: 1.24.1 - PyTorch version (GPU?): 2.1.1+cu118 (True) - PyTorch XPU available: False - PyTorch NPU available: False - System RAM: 125.41 GB - GPU type: NVIDIA GeForce RTX 3090 - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: fp16 - use_cpu: False - debug: True - num_processes: 8 - machine_rank: 0 - num_machines: 1 - gpu_ids: all - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [X] My own task or dataset (give details below) ### Reproduction my script is below,I removed some unnecessary code. If this code snippet doesn't help resolve the issue, I will provide the entire code repository. ```python vali_loss_collect=[] class Trainer(object): def __init__(self): self.train_loader = # self.vali_loader = # self.test_loader = # self.device = accelerator.device #torch.device(f"cuda:{str(self.gpu)}" if torch.cuda.is_available() else "cpu") self.criterion = self.model = MyModel() if torch.cuda.is_available(): self.model = self.model.to(self.device) self.optimizer = torch.optim.Adam(self.model.parameters(), lr=self.lr) self.model,self.optimizer,self.train_loader,self.vali_loader,self.test_loader=accelerator.prepare( self.model,self.optimizer,self.train_loader,self.vali_loader,self.test_loader ) def vali(self, vali_loader): self.model.eval() for i, (input_data, _) in enumerate(vali_loader): out = self.model(input_data) loss = self.criterion(out) all_losses = accelerator.gather(loss) vali_loss_collect.extend([ i.item() for i in all_losses]) if accelerator.is_local_main_process: a = np.mean(vali_loss_collect) vali_loss_collect.clear() return a else: return np.mean([1e9]) def train(self): early_stopping = EarlyStopping() for epoch in range(self.num_epochs): self.model.train() for i, (input_data, _) in enumerate(self.train_loader): output = self.model(input_data) loss = self.criterion(out) self.optimizer.zero_grad() accelerator.backward(loss) self.optimizer.step() vali_loss = self.vali(self.test_loader) early_stopping(vali_loss, self.model) if early_stopping.early_stop: print("stopped") break print(f"epoch:{epoch}") adjust_learning_rate(self.optimizer, epoch + 1, self.lr) ``` ### Expected behavior The script ran normally during the first epoch at the beginning, but when it reached the second epoch, some process seems to be stuck. ```bash # Since I'm training on 8 GPU , there are 8 print outputs. Epoch: 1, Epoch: 1,Epoch: 1 Epoch: 1, Epoch: 1, Epoch: 1, Epoch: 1, Epoch: 1, # at second epoch , only one output Epoch: 2, [E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out. Epoch: 2, Epoch: 2, Epoch: 2, Epoch: 2, [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out. Epoch: 2, [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. Epoch: 2, Epoch: 2, [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out. [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out. [2023-11-21 10:57:17,327] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46035 closing signal SIGTERM [2023-11-21 10:57:17,441] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 46036) of binary: /usr/bin/python3.10/bin/python3.10 Traceback (most recent call last): File "/usr/bin/python3.10/bin/accelerate", line 8, in <module> sys.exit(main()) File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command multi_gpu_launcher(args) File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher distrib_run.run(args) File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== main.py FAILED ------------------------------------------------------ Failures: [1]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 2 (local_rank: 2) exitcode : -6 (pid: 46037) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46037 [2]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 3 (local_rank: 3) exitcode : -6 (pid: 46038) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46038 [3]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 4 (local_rank: 4) exitcode : -6 (pid: 46039) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46039 [4]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 5 (local_rank: 5) exitcode : -6 (pid: 46040) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46040 [5]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 6 (local_rank: 6) exitcode : -6 (pid: 46041) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46041 [6]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 7 (local_rank: 7) exitcode : -6 (pid: 46042) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46042 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-21_10:57:17 host : d02371f49391 rank : 1 (local_rank: 1) exitcode : -6 (pid: 46036) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 46036 ====================================================== ```

Topic		Replies	Views
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6273	July 31, 2023
Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers 🤗Accelerate	5	1319	March 3, 2025
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2575	July 31, 2023
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14515	February 9, 2023
ControlNet "resume from checkpoint" Models	0	138	July 15, 2024

NCCL Timeout Accelerate Load From Checkpoint

Related topics