I think we need to set timeouts other than NCCL as well.
Your timeout still fires because you changed the wrong knob and you likely have a rank that is stalled. In PyTorch+DeepSpeed the watchdog that aborts jobs is owned by ProcessGroupNCCL, not by bare NCCL. Its default “Timeout(ms)” in logs is 600000 (10 minutes). DeepSpeed also has its own 30-minute default unless you override it. If Accelerate initializes the process group via DeepSpeed and ignores your override, you still hit 10 minutes. Logs from many reports show exactly this pattern. (GitHub)
Below is a detailed, practical map: background → common causes → concrete fixes.
Background in one page
-
There are multiple timeouts.
- PyTorch ProcessGroupNCCL watchdog controls the “Timeout(ms)=…” you see in crashes. You raise it via
init_process_group(timeout=...) or through frameworks that pass it down. If not set correctly, it stays around 10 minutes and aborts with messages like “Watchdog caught collective operation timeout: … Timeout(ms)=600000) ran for 6000xx ms.” (GitHub)
- DeepSpeed
init_distributed wraps Torch init and has its own default of 30 minutes, overridable with DEEPSPEED_TIMEOUT (seconds) or by passing an explicit timeout=timedelta(...). (deepspeed.readthedocs.io)
- NCCL library envs tune transport and network behavior (
NCCL_SOCKET_IFNAME, NCCL_P2P_LEVEL, IB timers, etc.). They do not by themselves raise the PyTorch watchdog limit. (NVIDIA Docs)
- Accelerate integration bug. Recent issues show
InitProcessGroupKwargs(timeout=...) was ignored when DeepSpeed owned initialization. Users saw 10-minute aborts until they upgraded Accelerate. (GitHub)
-
Timeouts are usually symptoms of a stalled rank, not an inherently too-short limit: save/eval pauses on one rank, uneven dataloader lengths, an OOM or exception on a single worker, or network setup issues. The watchdog times out the collective after N minutes. (PyTorch Forums)
Why your NCCL_TIMEOUT=1800000 did nothing
- The abort you see is from ProcessGroupNCCL’s watchdog, which looks at the PG timeout, not a raw
NCCL_TIMEOUT knob. Your logs show the watchdog honoring 600000 ms. Multiple users report that setting an env alone leaves the watchdog at 10 minutes. You must raise the PG timeout explicitly or via DeepSpeed. (GitHub)
- If Accelerate+DeepSpeed path ignores your kwargs, you still run with the default and hit the 10-minute abort. Recent Accelerate issues resolve this. (GitHub)
Root causes → fixes
A) Wrong timeout knob or it isn’t propagating
Cause. PG timeout wasn’t raised where DeepSpeed/Torch actually read it, or Accelerate dropped it.
Fix. Upgrade Accelerate. Set DEEPSPEED_TIMEOUT and pass an explicit PG timeout via kwargs.
# Versions
pip install -U accelerate deepspeed # keep these current
# DeepSpeed consumes this (seconds). Docs confirm override behavior.
# https://deepspeed.readthedocs.io/en/stable/initialize.html
export DEEPSPEED_TIMEOUT=5400
# Set the ProcessGroupNCCL timeout explicitly via Accelerate.
# https://huggingface.co/docs/accelerate/en/package_reference/kwargs
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.utils import DeepSpeedPlugin
pg_kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
accel = Accelerator(deepspeed_plugin=DeepSpeedPlugin(), kwargs_handlers=[pg_kwargs])
DeepSpeed docs: default 30 min, overridable by DEEPSPEED_TIMEOUT. HF reports: older Accelerate builds ignored kwargs with DS; newer builds fix it. (deepspeed.readthedocs.io)
Propagate to all ranks. DeepSpeed will forward NCCL_*/PYTHON* automatically and lets you pin extra envs with a .deepspeed_env file. Do this for multi-node. (DeepSpeed)
B) Rank desync during save/eval or data iteration
Symptoms. Timeout on ALLREDUCE, ALLGATHER right after eval or accelerator.save_state. One rank is slower or never reaches the collective. Many reports match this. (GitHub)
Fixes.
- Make loaders even per rank. Use
drop_last=True or ensure len(dataset) % world_size == 0. Mismatched steps per rank cause hangs. (PyTorch Forums)
- Guard checkpointing. Save from rank-0 only. Surround with barriers (or
accelerator.wait_for_everyone()). Users hit timeouts precisely on save. (GitHub)
- Treat any OOM or exception on a single worker as a desync root cause. Recent issues propose better behavior because a single crash leaves peers waiting until the watchdog fires. (GitHub)
C) Networking and container friction
Causes. Wrong NIC, slow/blocked path, IB tuning, small shared memory in Docker, or multi-node rendezvous issues. These often manifest as “works for small scale or single node, times out at scale.” (GitHub)
Fixes.
- Pick the NIC:
export NCCL_SOCKET_IFNAME=eth0 (or your interface). Tune or disable IB as needed. See NCCL env reference. (NVIDIA Docs)
- On NVLink boxes, try
export NCCL_P2P_LEVEL=NVL. Several users report stability improvements. (Stack Overflow)
- In containers, lift IPC/SHM/memlock:
--ipc=host --shm-size=8g --ulimit memlock=-1. This resolves several “timeout” patterns that were actually IPC pressure. (Stack Overflow)
- Validate the fabric with nccl-tests before training. If rings fail under load, fix network first. (GitHub)
D) PyTorch NCCL diagnostics
Turn on Torch’s NCCL diagnostics to find the offending collective and rank instead of guessing.
# PyTorch ProcessGroupNCCL diagnostics
# Docs: https://docs.pytorch.org/docs/stable/torch_nccl_environment_variables.html
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_DUMP_ON_TIMEOUT=1
export TORCH_NCCL_TRACE_BUFFER_SIZE=1048576
export NCCL_DEBUG=INFO
These envs print the timed-out op, sequence numbers, and often the stuck rank. Then you fix the exact point of desync. (docs.pytorch.org)
End-to-end “works in practice” recipe
- Upgrade and set timeouts where they are read.
pip install -U accelerate deepspeed
# https://deepspeed.readthedocs.io/en/stable/initialize.html
export DEEPSPEED_TIMEOUT=5400
# Optional: keep NCCL timers consistent. This alone won't raise PG timeout.
export NCCL_DEBUG=INFO
# https://huggingface.co/docs/accelerate/en/package_reference/kwargs
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.utils import DeepSpeedPlugin
accel = Accelerator(
deepspeed_plugin=DeepSpeedPlugin(),
kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=5400))],
)
-
Propagate envs across nodes with .deepspeed_env when launching multi-node. (DeepSpeed)
-
Harden the loop:
drop_last=True on train/eval samplers.
- Only rank-0 saves. Add
accelerator.wait_for_everyone() around checkpoint I/O.
Evidence: timeouts cluster around save/eval in reports. (GitHub)
- Harden the runtime:
- Container flags:
--ipc=host --shm-size=8g --ulimit memlock=-1.
- Network hints:
NCCL_SOCKET_IFNAME=..., NCCL_P2P_LEVEL=NVL on NVLink hosts.
- Multi-node: verify rendezvous and that all ranks see identical envs. (Stack Overflow)
- Debug with signal, not guesses: enable the
TORCH_NCCL_* variables, reproduce, read which collective and rank timed out, then fix that call-site. (docs.pytorch.org)
Quick sanity checks you can run
- Confirm the PG timeout actually changed. Newer Accelerate honors
InitProcessGroupKwargs with DeepSpeed; older builds did not. If your logs still show Timeout(ms)=600000), your override is not wired. (GitHub)
- Repro on a single node with
nccl-tests. If fabric is flaky, you will see it there too. (GitHub)
Curated references (grouped)
Raise the correct timeout
- DeepSpeed
timeout and DEEPSPEED_TIMEOUT documented. Shows 30-min default and env override. (deepspeed.readthedocs.io)
- Accelerate kwargs and examples for
InitProcessGroupKwargs(timeout=...). Use recent docs. (Hugging Face)
- Accelerate bug threads where timeout was ignored with DeepSpeed. Upgrade to fix. (GitHub)
Evidence of the 10-minute watchdog and typical failure logs
- GitHub issues and HF threads with
Timeout(ms)=600000) in ProcessGroupNCCL logs. (GitHub)
Rank-desync during save/eval or load
- HF issue: timeout “when try to save.”
- HF forum:
accelerator.save_state timeout.
- HF forum: resume/load then timeout on first backward. (GitHub)
Diagnostics and NCCL envs
- PyTorch ProcessGroupNCCL env reference:
TORCH_NCCL_BLOCKING_WAIT, dump and tracing knobs. (docs.pytorch.org)
- NCCL env catalogue for networking knobs. (NVIDIA Docs)
Networking and container hygiene
- SHM/memlock and
NCCL_P2P_LEVEL=NVL fixes reported by practitioners. (Stack Overflow)
- DeepSpeed env propagation with
.deepspeed_env. (DeepSpeed)
nccl-tests to validate topology. (GitHub)
Bottom line
Raise the PG timeout where it is consumed (DeepSpeed timeout or DEEPSPEED_TIMEOUT plus InitProcessGroupKwargs). Upgrade Accelerate so those settings apply. Then remove rank desyncs in save/eval and fix the network/container. This combination, not NCCL_TIMEOUT alone, stops the watchdog aborts. (deepspeed.readthedocs.io)