huggingface accelerate[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of ‘std::runtime_error’ what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out.
I also met the same problem! Any solution?
I am facing a similar problem. I am getting this error in the middle of tokenizing a large dataset.
When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.