Hi all! I have a 12B model that is distributed on 4 GPUs and a 2.8B model that is also distributed. I’m performing inference on 12B followed by training 2.8B. However, after a first few thousand steps of training, I’m getting the NCCL timeout error. I even tried increasing the default timeout to 90 minutes using this: os.environ["NCCL_TIMEOUT"] = "5400" os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" but looks like it’s not getting applied and the default value is still 10 minutes. Here is my error:
rank1]:[E715 17:27:51.148884976 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank2]:[E715 19:31:08.566950534 ProcessGroupNCCL.cpp:756] [Rank 2] Work WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) timed out in blocking wait.
[rank2]:[E715 19:31:09.873338639 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E715 19:31:09.873366470 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
I would appreciate any pointers! I’m running the script using accelerate launch script.py and have a deepspeed config file setup already.
@John6666 do you mean only setting up DS env directly but still using accelerate for training? or do you mean completely switching to DS for everything (moving away from HF)?
@John6666 looks like my previous solution didn’t work, so I’m back at it
Do you mean creating a config.json myself and then passing it in then training args using deepseed=config.json? And then do I run the script using the usual accelerate launch script.py?
Yeah. I haven’t actually used DeepSpeed, but I think that method above is the most moderate way to work around it. If you handle DeepSpeed manually, it should be fine.
As for the compatibility issue with Accelerate, that’s probably the case, but fixing it with a patch would be difficult…
Hi, this is pretty much still an issue.
DeepSpeed with accelerate just hangs at the end of an epoch, sitting there doing nothing. No errors thrown, until a time out occurs.
I’ve tried:
NCCL_P2P_DISABLE
NCCL_IB_DISABLE=1
NCCL_CUMEM_ENABLE=0
NCCL_SHM_DISABLE=1
And various debug approaches, such as:
CUDA_LAUNCH_BLOCKING=1
TORCH_NCCL_ASYNC_ERROR_HANDLING=1
What really puzzles me is that DeepSpeed on a single GPU, aka when I disable the rest via CUDA_VISIBLE_DEVICES="0" works fine. As soon as I start enabling them back, it hangs at the end of the epoch. All 8 GPUs are using RAM and show 100% usage.