Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

I just found this GH Issue: huggingface/accelerate#223
and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800

When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.

1 Like