NCCL timeout + corrupts checkpoint/latest

I am using pre-trained XLM-Roberta model weights and fine-tuning it for my task.
The model is running on 2 V100 GPU using the trainer function

After 2 epochs are completed, I get the following error:

[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=101360, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809293 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=101360, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809293 milliseconds before timing out.

Furthermore, the checkpoint/latest file gets corrupted and the training begins from scratch.

Can you please help?

Hi, did you resolve the issue? I encountered similar issue on multiple V100 GPUs but the error couldn’t be reprocued in RTX 3090 + 2080 dual GPUs using the same code.