NCCL timeout + corrupts checkpoint/latest

sabharadwaj · April 27, 2023, 3:46am

I am using pre-trained XLM-Roberta model weights and fine-tuning it for my task.
The model is running on 2 V100 GPU using the trainer function

After 2 epochs are completed, I get the following error:

[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=101360, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809293 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=101360, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809293 milliseconds before timing out.

Furthermore, the checkpoint/latest file gets corrupted and the training begins from scratch.

Can you please help?

qiyan98 · July 31, 2023, 7:30pm

Hi, did you resolve the issue? I encountered similar issue on multiple V100 GPUs but the error couldn’t be reprocued in RTX 3090 + 2080 dual GPUs using the same code.

Topic		Replies	Views
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels 🤗Accelerate	3	14513	February 9, 2023
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6272	July 31, 2023
NCCL Timeout Accelerate Load From Checkpoint 🤗Accelerate	2	2374	June 20, 2025
Accelerator.save_state errors out due to timeout. Unable to increase timeout through kwargs_handlers 🤗Accelerate	5	1319	March 3, 2025
Hyper Parameter Optimization with Optuna backend timeout when using Pytorch DDP 🤗Transformers	0	564	February 9, 2024

NCCL timeout + corrupts checkpoint/latest

Related topics