Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

melhoushi · February 9, 2023, 10:37pm

I just found this GH Issue: huggingface/accelerate#223
and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800

When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.

Topic		Replies	Views
NCCL timeout + corrupts checkpoint/latest DeepSpeed	1	2529	July 31, 2023
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6060	July 31, 2023
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	0	2305	March 16, 2023
Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers 🤗Accelerate	5	1245	March 3, 2025
Sagemaker instances do not restart after TGI container crashses Amazon SageMaker	0	378	July 17, 2023

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

Related topics