NCCL Watchdog Timeout error while using Deepspeed and accelerate

Hi all! I have a 12B model that is distributed on 4 GPUs and a 2.8B model that is also distributed. I’m performing inference on 12B followed by training 2.8B. However, after a first few thousand steps of training, I’m getting the NCCL timeout error. I even tried increasing the default timeout to 90 minutes using this: os.environ["NCCL_TIMEOUT"] = "5400" os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" but looks like it’s not getting applied and the default value is still 10 minutes. Here is my error:

rank1]:[E715 17:27:51.148884976 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.

[rank2]:[E715 19:31:08.566950534 ProcessGroupNCCL.cpp:756] [Rank 2] Work WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) timed out in blocking wait.

[rank2]:[E715 19:31:09.873338639 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E715 19:31:09.873366470 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.

I would appreciate any pointers! I’m running the script using accelerate launch script.py and have a deepspeed config file setup already.

1 Like

I think there might be a bug…
Setting .deepspeed_env directly might be a quicker workaround.

thanks! let me try that and see if it helps.

1 Like

@John6666 do you mean only setting up DS env directly but still using accelerate for training? or do you mean completely switching to DS for everything (moving away from HF)?

1 Like

DS env directly but still using accelerate for training?

this one.

1 Like

@John6666 it looks like export NCCL_P2P_LEVEL=NVL did the job!

1 Like

@John6666 looks like my previous solution didn’t work, so I’m back at it :frowning:

Do you mean creating a config.json myself and then passing it in then training args using deepseed=config.json? And then do I run the script using the usual accelerate launch script.py?

1 Like

Yeah. I haven’t actually used DeepSpeed, but I think that method above is the most moderate way to work around it. If you handle DeepSpeed manually, it should be fine.

As for the compatibility issue with Accelerate, that’s probably the case, but fixing it with a patch would be difficult…

Or it might be possible to replace DeepSpeed with a different framework.

1 Like

Hi, this is pretty much still an issue.
DeepSpeed with accelerate just hangs at the end of an epoch, sitting there doing nothing. No errors thrown, until a time out occurs.

I’ve tried:

  • NCCL_P2P_DISABLE
  • NCCL_IB_DISABLE=1
  • NCCL_CUMEM_ENABLE=0
  • NCCL_SHM_DISABLE=1

And various debug approaches, such as:

  • CUDA_LAUNCH_BLOCKING=1
  • TORCH_NCCL_ASYNC_ERROR_HANDLING=1

What really puzzles me is that DeepSpeed on a single GPU, aka when I disable the rest via CUDA_VISIBLE_DEVICES="0" works fine. As soon as I start enabling them back, it hangs at the end of the epoch. All 8 GPUs are using RAM and show 100% usage.

2 Likes

I still have the same issue, haven’t found any solution yet :frowning:

1 Like

os.environ["NCCL_TIMEOUT"] = "5400"

A bug that caused this environment variable to be overwritten and ignored by accelerate seems to have been fixed a few weeks ago.:sweat_smile:

pip install git+https://github.com/huggingface/accelerate
1 Like