NCCL Watchdog Timeout error while using Deepspeed and accelerate

Hi all! I have a 12B model that is distributed on 4 GPUs and a 2.8B model that is also distributed. I’m performing inference on 12B followed by training 2.8B. However, after a first few thousand steps of training, I’m getting the NCCL timeout error. I even tried increasing the default timeout to 90 minutes using this: os.environ["NCCL_TIMEOUT"] = "5400" os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" but looks like it’s not getting applied and the default value is still 10 minutes. Here is my error:

rank1]:[E715 17:27:51.148884976 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.

[rank2]:[E715 19:31:08.566950534 ProcessGroupNCCL.cpp:756] [Rank 2] Work WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) timed out in blocking wait.

[rank2]:[E715 19:31:09.873338639 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E715 19:31:09.873366470 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.

I would appreciate any pointers! I’m running the script using accelerate launch script.py and have a deepspeed config file setup already.

1 Like

I think there might be a bug…
Setting .deepspeed_env directly might be a quicker workaround.

thanks! let me try that and see if it helps.

1 Like

@John6666 do you mean only setting up DS env directly but still using accelerate for training? or do you mean completely switching to DS for everything (moving away from HF)?

1 Like

DS env directly but still using accelerate for training?

this one.

1 Like

@John6666 it looks like export NCCL_P2P_LEVEL=NVL did the job!

1 Like