Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

Garygedegege · November 28, 2022, 5:13pm

huggingface accelerate[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out.

SinclairWang · December 9, 2022, 8:05am

I also met the same problem! Any solution?

melhoushi · January 30, 2023, 3:05pm

I am facing a similar problem. I am getting this error in the middle of tokenizing a large dataset.

melhoushi · February 9, 2023, 10:37pm

I just found this GH Issue: huggingface/accelerate#223
and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800

When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.

Topic		Replies	Views
Code RuntimeError 🤗Accelerate	2	1377	October 22, 2023
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	2	2599	June 20, 2025
Error when saving model in accelerate 🤗Accelerate	5	4081	April 13, 2023
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2616	July 31, 2023
Troubleshooting help? Everything just hangs 🤗Accelerate	2	3449	July 12, 2022

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

Related topics