NCCL Watchdog Timeout error while using Deepspeed and accelerate

Hi all! I have a 12B model that is distributed on 4 GPUs and a 2.8B model that is also distributed. I’m performing inference on 12B followed by training 2.8B. However, after a first few thousand steps of training, I’m getting the NCCL timeout error. I even tried increasing the default timeout to 90 minutes using this: os.environ["NCCL_TIMEOUT"] = "5400" os.environ["TORCH_NCCL_BLOCKING_WAIT"] = "1" but looks like it’s not getting applied and the default value is still 10 minutes. Here is my error:

rank1]:[E715 17:27:51.148884976 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.

[rank2]:[E715 19:31:08.566950534 ProcessGroupNCCL.cpp:756] [Rank 2] Work WorkNCCL(SeqNum=1518933, OpType=ALLREDUCE, NumelIn=10490880, NumelOut=10490880, Timeout(ms)=600000) timed out in blocking wait.

[rank2]:[E715 19:31:09.873338639 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E715 19:31:09.873366470 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.

I would appreciate any pointers! I’m running the script using accelerate launch script.py and have a deepspeed config file setup already.

1 Like

I think there might be a bug…
Setting .deepspeed_env directly might be a quicker workaround.

thanks! let me try that and see if it helps.

1 Like

@John6666 do you mean only setting up DS env directly but still using accelerate for training? or do you mean completely switching to DS for everything (moving away from HF)?

1 Like

DS env directly but still using accelerate for training?

this one.

1 Like

@John6666 it looks like export NCCL_P2P_LEVEL=NVL did the job!

1 Like

@John6666 looks like my previous solution didn’t work, so I’m back at it :frowning:

Do you mean creating a config.json myself and then passing it in then training args using deepseed=config.json? And then do I run the script using the usual accelerate launch script.py?

1 Like

Yeah. I haven’t actually used DeepSpeed, but I think that method above is the most moderate way to work around it. If you handle DeepSpeed manually, it should be fine.

As for the compatibility issue with Accelerate, that’s probably the case, but fixing it with a patch would be difficult…

Or it might be possible to replace DeepSpeed with a different framework.

1 Like

Hi, this is pretty much still an issue.
DeepSpeed with accelerate just hangs at the end of an epoch, sitting there doing nothing. No errors thrown, until a time out occurs.

I’ve tried:

  • NCCL_P2P_DISABLE
  • NCCL_IB_DISABLE=1
  • NCCL_CUMEM_ENABLE=0
  • NCCL_SHM_DISABLE=1

And various debug approaches, such as:

  • CUDA_LAUNCH_BLOCKING=1
  • TORCH_NCCL_ASYNC_ERROR_HANDLING=1

What really puzzles me is that DeepSpeed on a single GPU, aka when I disable the rest via CUDA_VISIBLE_DEVICES="0" works fine. As soon as I start enabling them back, it hangs at the end of the epoch. All 8 GPUs are using RAM and show 100% usage.

2 Likes

I still have the same issue, haven’t found any solution yet :frowning:

1 Like

os.environ["NCCL_TIMEOUT"] = "5400"

A bug that caused this environment variable to be overwritten and ignored by accelerate seems to have been fixed a few weeks ago.:sweat_smile:

pip install git+https://github.com/huggingface/accelerate
2 Likes

NCCL can modify timeout in 1.11.0, i set NCCL_TIMEOUT=1800000 (30 minutes), the timeout still occurs.
if you fixed it, tell me the method, please!!!

1 Like

I think we need to set timeouts other than NCCL as well.


Your timeout still fires because you changed the wrong knob and you likely have a rank that is stalled. In PyTorch+DeepSpeed the watchdog that aborts jobs is owned by ProcessGroupNCCL, not by bare NCCL. Its default “Timeout(ms)” in logs is 600000 (10 minutes). DeepSpeed also has its own 30-minute default unless you override it. If Accelerate initializes the process group via DeepSpeed and ignores your override, you still hit 10 minutes. Logs from many reports show exactly this pattern. (GitHub)

Below is a detailed, practical map: background → common causes → concrete fixes.


Background in one page

  • There are multiple timeouts.

    1. PyTorch ProcessGroupNCCL watchdog controls the “Timeout(ms)=…” you see in crashes. You raise it via init_process_group(timeout=...) or through frameworks that pass it down. If not set correctly, it stays around 10 minutes and aborts with messages like “Watchdog caught collective operation timeout: … Timeout(ms)=600000) ran for 6000xx ms.” (GitHub)
    2. DeepSpeed init_distributed wraps Torch init and has its own default of 30 minutes, overridable with DEEPSPEED_TIMEOUT (seconds) or by passing an explicit timeout=timedelta(...). (deepspeed.readthedocs.io)
    3. NCCL library envs tune transport and network behavior (NCCL_SOCKET_IFNAME, NCCL_P2P_LEVEL, IB timers, etc.). They do not by themselves raise the PyTorch watchdog limit. (NVIDIA Docs)
    4. Accelerate integration bug. Recent issues show InitProcessGroupKwargs(timeout=...) was ignored when DeepSpeed owned initialization. Users saw 10-minute aborts until they upgraded Accelerate. (GitHub)
  • Timeouts are usually symptoms of a stalled rank, not an inherently too-short limit: save/eval pauses on one rank, uneven dataloader lengths, an OOM or exception on a single worker, or network setup issues. The watchdog times out the collective after N minutes. (PyTorch Forums)


Why your NCCL_TIMEOUT=1800000 did nothing

  • The abort you see is from ProcessGroupNCCL’s watchdog, which looks at the PG timeout, not a raw NCCL_TIMEOUT knob. Your logs show the watchdog honoring 600000 ms. Multiple users report that setting an env alone leaves the watchdog at 10 minutes. You must raise the PG timeout explicitly or via DeepSpeed. (GitHub)
  • If Accelerate+DeepSpeed path ignores your kwargs, you still run with the default and hit the 10-minute abort. Recent Accelerate issues resolve this. (GitHub)

Root causes → fixes

A) Wrong timeout knob or it isn’t propagating

Cause. PG timeout wasn’t raised where DeepSpeed/Torch actually read it, or Accelerate dropped it.
Fix. Upgrade Accelerate. Set DEEPSPEED_TIMEOUT and pass an explicit PG timeout via kwargs.

# Versions
pip install -U accelerate deepspeed  # keep these current
# DeepSpeed consumes this (seconds). Docs confirm override behavior.
# https://deepspeed.readthedocs.io/en/stable/initialize.html
export DEEPSPEED_TIMEOUT=5400
# Set the ProcessGroupNCCL timeout explicitly via Accelerate.
# https://huggingface.co/docs/accelerate/en/package_reference/kwargs
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.utils import DeepSpeedPlugin

pg_kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
accel = Accelerator(deepspeed_plugin=DeepSpeedPlugin(), kwargs_handlers=[pg_kwargs])

DeepSpeed docs: default 30 min, overridable by DEEPSPEED_TIMEOUT. HF reports: older Accelerate builds ignored kwargs with DS; newer builds fix it. (deepspeed.readthedocs.io)

Propagate to all ranks. DeepSpeed will forward NCCL_*/PYTHON* automatically and lets you pin extra envs with a .deepspeed_env file. Do this for multi-node. (DeepSpeed)


B) Rank desync during save/eval or data iteration

Symptoms. Timeout on ALLREDUCE, ALLGATHER right after eval or accelerator.save_state. One rank is slower or never reaches the collective. Many reports match this. (GitHub)

Fixes.

  • Make loaders even per rank. Use drop_last=True or ensure len(dataset) % world_size == 0. Mismatched steps per rank cause hangs. (PyTorch Forums)
  • Guard checkpointing. Save from rank-0 only. Surround with barriers (or accelerator.wait_for_everyone()). Users hit timeouts precisely on save. (GitHub)
  • Treat any OOM or exception on a single worker as a desync root cause. Recent issues propose better behavior because a single crash leaves peers waiting until the watchdog fires. (GitHub)

C) Networking and container friction

Causes. Wrong NIC, slow/blocked path, IB tuning, small shared memory in Docker, or multi-node rendezvous issues. These often manifest as “works for small scale or single node, times out at scale.” (GitHub)

Fixes.

  • Pick the NIC: export NCCL_SOCKET_IFNAME=eth0 (or your interface). Tune or disable IB as needed. See NCCL env reference. (NVIDIA Docs)
  • On NVLink boxes, try export NCCL_P2P_LEVEL=NVL. Several users report stability improvements. (Stack Overflow)
  • In containers, lift IPC/SHM/memlock: --ipc=host --shm-size=8g --ulimit memlock=-1. This resolves several “timeout” patterns that were actually IPC pressure. (Stack Overflow)
  • Validate the fabric with nccl-tests before training. If rings fail under load, fix network first. (GitHub)

D) PyTorch NCCL diagnostics

Turn on Torch’s NCCL diagnostics to find the offending collective and rank instead of guessing.

# PyTorch ProcessGroupNCCL diagnostics
# Docs: https://docs.pytorch.org/docs/stable/torch_nccl_environment_variables.html
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_DUMP_ON_TIMEOUT=1
export TORCH_NCCL_TRACE_BUFFER_SIZE=1048576
export NCCL_DEBUG=INFO

These envs print the timed-out op, sequence numbers, and often the stuck rank. Then you fix the exact point of desync. (docs.pytorch.org)


End-to-end “works in practice” recipe

  1. Upgrade and set timeouts where they are read.
pip install -U accelerate deepspeed
# https://deepspeed.readthedocs.io/en/stable/initialize.html
export DEEPSPEED_TIMEOUT=5400
# Optional: keep NCCL timers consistent. This alone won't raise PG timeout.
export NCCL_DEBUG=INFO
# https://huggingface.co/docs/accelerate/en/package_reference/kwargs
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.utils import DeepSpeedPlugin
accel = Accelerator(
    deepspeed_plugin=DeepSpeedPlugin(),
    kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=5400))],
)
  1. Propagate envs across nodes with .deepspeed_env when launching multi-node. (DeepSpeed)

  2. Harden the loop:

  • drop_last=True on train/eval samplers.
  • Only rank-0 saves. Add accelerator.wait_for_everyone() around checkpoint I/O.
    Evidence: timeouts cluster around save/eval in reports. (GitHub)
  1. Harden the runtime:
  • Container flags: --ipc=host --shm-size=8g --ulimit memlock=-1.
  • Network hints: NCCL_SOCKET_IFNAME=..., NCCL_P2P_LEVEL=NVL on NVLink hosts.
  • Multi-node: verify rendezvous and that all ranks see identical envs. (Stack Overflow)
  1. Debug with signal, not guesses: enable the TORCH_NCCL_* variables, reproduce, read which collective and rank timed out, then fix that call-site. (docs.pytorch.org)

Quick sanity checks you can run

  • Confirm the PG timeout actually changed. Newer Accelerate honors InitProcessGroupKwargs with DeepSpeed; older builds did not. If your logs still show Timeout(ms)=600000), your override is not wired. (GitHub)
  • Repro on a single node with nccl-tests. If fabric is flaky, you will see it there too. (GitHub)

Curated references (grouped)

Raise the correct timeout

  • DeepSpeed timeout and DEEPSPEED_TIMEOUT documented. Shows 30-min default and env override. (deepspeed.readthedocs.io)
  • Accelerate kwargs and examples for InitProcessGroupKwargs(timeout=...). Use recent docs. (Hugging Face)
  • Accelerate bug threads where timeout was ignored with DeepSpeed. Upgrade to fix. (GitHub)

Evidence of the 10-minute watchdog and typical failure logs

  • GitHub issues and HF threads with Timeout(ms)=600000) in ProcessGroupNCCL logs. (GitHub)

Rank-desync during save/eval or load

  • HF issue: timeout “when try to save.”
  • HF forum: accelerator.save_state timeout.
  • HF forum: resume/load then timeout on first backward. (GitHub)

Diagnostics and NCCL envs

  • PyTorch ProcessGroupNCCL env reference: TORCH_NCCL_BLOCKING_WAIT, dump and tracing knobs. (docs.pytorch.org)
  • NCCL env catalogue for networking knobs. (NVIDIA Docs)

Networking and container hygiene

  • SHM/memlock and NCCL_P2P_LEVEL=NVL fixes reported by practitioners. (Stack Overflow)
  • DeepSpeed env propagation with .deepspeed_env. (DeepSpeed)
  • nccl-tests to validate topology. (GitHub)

Bottom line

Raise the PG timeout where it is consumed (DeepSpeed timeout or DEEPSPEED_TIMEOUT plus InitProcessGroupKwargs). Upgrade Accelerate so those settings apply. Then remove rank desyncs in save/eval and fix the network/container. This combination, not NCCL_TIMEOUT alone, stops the watchdog aborts. (deepspeed.readthedocs.io)

1 Like

i think it is OOM. using another framework throws an OOM error after running for a while.

thank you for providing such great think :slight_smile:

1 Like