huggingface accelerate[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of âstd::runtime_errorâ what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out.
1 Like
I also met the same problem! Any solution?
I am facing a similar problem. I am getting this error in the middle of tokenizing a large dataset.
4 Likes
I just found this GH Issue: huggingface/accelerate#223
and it seems that we can add a timeout
argument to the Accelerator
constructor (default is 1800
When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldnât face the issue again.
1 Like