Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

huggingface accelerate[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out.

1 Like

I also met the same problem! Any solution?

I am facing a similar problem. I am getting this error in the middle of tokenizing a large dataset.

3 Likes

I just found this GH Issue: huggingface/accelerate#223
and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800

When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.

1 Like