NCCL Timeout Accelerate Load From Checkpoint

I also have the same issue. Resuming a long training run from checkpoint is timing out after 600000 ms. No matter what I try, I cannot change avoid the “ProcessGroupNCCL watchdog hang”. How can I increase that value when running accelerate launch?

1 Like