Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

huggingface accelerate[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out.

I also met the same problem! Any solution?

I am facing a similar problem. I am getting this error in the middle of tokenizing a large dataset.