NCCL Watchdog Timeout error while using Deepspeed and accelerate

jaydeepb · July 16, 2025, 3:10am

Hi all! I have a 12B model that is distributed on 4 GPUs and a 2.8B model that is also distributed. I’m performing inference on 12B followed by training 2.8B. However, after a first few thousand steps of training, I’m getting the NCCL timeout error. I even tried increasing the default timeout to 90 minutes using this: os.environ["NCCL_TIMEOUT"] = "5400" os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" but looks like it’s not getting applied and the default value is still 10 minutes. Here is my error:

rank1]:[E715 17:27:51.148884976 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.

[rank2]:[E715 19:31:08.566950534 ProcessGroupNCCL.cpp:756] [Rank 2] Work WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) timed out in blocking wait.

[rank2]:[E715 19:31:09.873338639 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E715 19:31:09.873366470 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.

I would appreciate any pointers! I’m running the script using accelerate launch script.py and have a deepspeed config file setup already.

John6666 · July 16, 2025, 3:32am

I think there might be a bug…
Setting .deepspeed_env directly might be a quicker workaround.

github.com/huggingface/accelerate

When using deepspeed, setting timeout is invalid

opened 11:33AM - 15 Jul 25 UTC

xliu0105

### System Info ```Shell accelerator 1.8.1 Ubuntu 22.04 ``` ### Information -… [x] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [x] My own task or dataset (give details below) ### Reproduction In version 1.8.1, when using the following code to set NCCL timeout and Deepspeed is used at the same time, the timeout setting is invalid. ``` deepspeed_cfg = DeepSpeedPlugin( zero_stage=2, gradient_accumulation_steps=cfg.policy.gradient_accumulation_steps, gradient_clipping=1.0) kwargs = InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=5000)) accelerator = Accelerator( deepspeed_plugin=deepspeed_cfg, kwargs_handlers=[kwargs], ) ``` Through the above code, we hope that the timeout value is 5000 seconds, but this will not take effect, and the timeout value is still the default 600 seconds. I haven't studied the parameter passing logic of accelerate carefully, so I provide the author with the following ideas, hoping to help the author modify the code of accelerate: When using deepspeed, the first time that the code call `dist.init_distributed` function of deepspeed is in [line355 of accelerator.py](https://github.com/huggingface/accelerate/blob/524e5f9828b106b054ffd20d5b007008ab427544/src/accelerate/accelerator.py#L335) and [line216 of state.py](https://github.com/huggingface/accelerate/blob/524e5f9828b106b054ffd20d5b007008ab427544/src/accelerate/state.py#L216), if I pass param `timeout = timedelta(seconds=5000)` in `PartialState` function, the timeout setting will work. Therefore, I guess there is a problem with the accelerator's timeout parameter passing somewhere. I hope the author can check and modify it. ### Expected behavior Setting nccl timeout when using deepspeed works fine.

github.com/huggingface/accelerate

[bug] - AcceleratorState initialization in DS case ignores timeout argument

opened 07:35PM - 28 Feb 23 UTC

closed 01:05PM - 02 Mar 23 UTC

VictorSanh

### System Info ```Shell - `Accelerate` version: 0.16.0 - Platform: Linux-5….4.0-1100-gcp-x86_64-with-glibc2.17 - Python version: 3.8.13 - Numpy version: 1.23.3 - PyTorch version (GPU?): 1.12.1+cu102 (True) - `Accelerate` default config: Not found ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [X] My own task or dataset (give details below) ### Reproduction Essentially, the problem is that this line https://github.com/huggingface/accelerate/blob/639c1da8df8cdc0dfa4dd057c7f3e7e01b40f3f5/src/accelerate/state.py#L129 does not use kwargs. So even if I tune the timeout with a `InitProcessGroupKwargs(timeout=...)`, deepspeed which initialize the dist backend, doesn't have the new timeout. I believe the line should pass the `timeout` arg conditionally on the fact that `kwargs` has it. in my `main.py`: ``` kwargs_handlers = [InitProcessGroupKwargs(timeout=timedelta(seconds=3_600))] accelerator = Accelerator( log_with="all", rng_types=["torch", "cuda", "generator"], gradient_accumulation_steps=GRAD_ACC_SIZE, kwargs_handlers=kwargs_handlers, ) ``` `accelerate_config.yaml`: ```compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_multinode_launcher: standard deepspeed_config_file: /home/victor/m4/experiments/pretraining/vloom/tr_victordebug/ds_config.json zero3_init_flag: true distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: 29399 main_training_function: main num_machines: 1 num_processes: 1 use_cpu: false gpu_ids: 0, ``` `ds_config.json`: ``` { "bf16": { "enabled": false }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0.0, "initial_scale_power": 1, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": "auto", "contiguous_gradients": true, "stage3_gather_16bit_weights_on_model_save": true, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" } }, "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 2000000 } ``` Deepspeed version: `Version: 0.8.2+9bfee830` ### Expected behavior ```Shell - ```

jaydeepb · July 16, 2025, 3:52am

thanks! let me try that and see if it helps.

jaydeepb · July 16, 2025, 4:45am

@John6666 do you mean only setting up DS env directly but still using accelerate for training? or do you mean completely switching to DS for everything (moving away from HF)?

John6666 · July 16, 2025, 5:25am

DS env directly but still using accelerate for training?

this one.

jaydeepb · July 16, 2025, 10:56pm

@John6666 it looks like export NCCL_P2P_LEVEL=NVL did the job!

jaydeepb · July 21, 2025, 6:25am

@John6666 looks like my previous solution didn’t work, so I’m back at it

Do you mean creating a config.json myself and then passing it in then training args using deepseed=config.json? And then do I run the script using the usual accelerate launch script.py?

John6666 · July 21, 2025, 7:22am

Yeah. I haven’t actually used DeepSpeed, but I think that method above is the most moderate way to work around it. If you handle DeepSpeed manually, it should be fine.

As for the compatibility issue with Accelerate, that’s probably the case, but fixing it with a patch would be difficult…

Or it might be possible to replace DeepSpeed with a different framework.

alexge233 · August 4, 2025, 5:05pm

Hi, this is pretty much still an issue.
DeepSpeed with accelerate just hangs at the end of an epoch, sitting there doing nothing. No errors thrown, until a time out occurs.

I’ve tried:

NCCL_P2P_DISABLE
NCCL_IB_DISABLE=1
NCCL_CUMEM_ENABLE=0
NCCL_SHM_DISABLE=1

And various debug approaches, such as:

CUDA_LAUNCH_BLOCKING=1
TORCH_NCCL_ASYNC_ERROR_HANDLING=1

What really puzzles me is that DeepSpeed on a single GPU, aka when I disable the rest via CUDA_VISIBLE_DEVICES="0" works fine. As soon as I start enabling them back, it hangs at the end of the epoch. All 8 GPUs are using RAM and show 100% usage.

jaydeepb · August 7, 2025, 4:49pm

I still have the same issue, haven’t found any solution yet

John6666 · August 7, 2025, 10:09pm

os.environ["NCCL_TIMEOUT"] = "5400"

A bug that caused this environment variable to be overwritten and ignored by accelerate seems to have been fixed a few weeks ago.

pip install git+https://github.com/huggingface/accelerate

Topic		Replies	Views
Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers 🤗Accelerate	5	1478	March 3, 2025
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14816	February 9, 2023
Error when saving model in accelerate 🤗Accelerate	5	4087	April 13, 2023
Code RuntimeError 🤗Accelerate	2	1377	October 22, 2023
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	2	2616	June 20, 2025

NCCL Watchdog Timeout error while using Deepspeed and accelerate

Related topics