I just found this GH Issue: huggingface/accelerate#223
and it seems that we can add a timeout
argument to the Accelerator
constructor (default is 1800
When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.