Timeout Issue with DeepSpeed on Multiple GPUs

Amerehei · December 20, 2024, 2:34pm

Hi everyone,

I’m currently using DeepSpeed to train my model and encountering an issue when scaling up the number of GPUs. Here’s the command I’m using:

accelerate launch --config_file CONFIG_FILE_PATH my_script.py

Config file

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 10
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The Problem

When I run my code on 2 GPUs with num_processes:2, everything works fine.
When I scale up to 4 GPUs or 10 GPUs, the script times out after 15 minutes with no progress in training and no W&B log.

Here’s what I’ve noticed:

The GPUs show 100% usage, but there’s only minimal VRAM usage.
No output or training progress is logged before the timeout.

What I’ve Tried

Using zero_stage=1.
Using zero_stage=3.

Unfortunately, these configurations haven’t resolved the issue.

Logs

[2024-12-20 14:04:18,013] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] 
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] *****************************************
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] *****************************************
[2024-12-20 14:04:29,429] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,159] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,354] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,399] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:30,433] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,568] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,568] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,575] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,600] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,659] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,700] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:31,226] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,365] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,490] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,539] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,575] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,575] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-12-20 14:04:31,582] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,692] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,717] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,739] [INFO] [comm.py:652:init_distributed] cdb=None
[rank6]:[E1220 14:14:37.438112374 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank6]:[E1220 14:14:37.438336003 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank2]:[E1220 14:14:37.442507458 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank2]:[E1220 14:14:37.442713701 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank1]:[E1220 14:14:37.457598340 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank1]:[E1220 14:14:37.457813862 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank7]:[E1220 14:14:37.457947898 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[rank7]:[E1220 14:14:37.458200735 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank8]:[E1220 14:14:37.460657147 ProcessGroupNCCL.cpp:616] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[rank8]:[E1220 14:14:37.460864558 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 8] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank9]:[E1220 14:14:37.485040965 ProcessGroupNCCL.cpp:616] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
[rank9]:[E1220 14:14:37.485146880 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 9] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank0]:[E1220 14:14:37.490184113 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank0]:[E1220 14:14:37.490437745 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank3]:[E1220 14:14:37.500092023 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank3]:[E1220 14:14:37.500319630 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank4]:[E1220 14:14:37.501516335 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
[rank4]:[E1220 14:14:37.501715737 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank5]:[E1220 14:14:37.505478470 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600074 milliseconds before timing out.
[rank5]:[E1220 14:14:37.505672530 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank0]:[E1220 14:20:17.814397241 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank0]:[E1220 14:20:17.814446641 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1220 14:20:17.814469986 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1220 14:20:17.819325164 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration > >) + 0x282 (0x787b95e19772 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x787b95e20bb3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x787b95e2261d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4:  + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #5:  + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of ‘c10::DistBackendError’

what():  [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.

Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x787b95e19772 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x787b95e20bb3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x787b95e2261d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)

frame #4:  + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)

frame #5:  + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

frame #6: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)

frame #1:  + 0xe4271b (0x787b95a8f71b in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)

frame #2:  + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)

frame #3:  + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

frame #4: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank8]: Traceback (most recent call last):

[rank8]:   File “/workspace/train.py”, line 94, in 

[rank8]:     fire.Fire(train)

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 135, in Fire

[rank8]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)

[rank8]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 468, in _Fire

[rank8]:     component, remaining_args = _CallAndUpdateTrace(

[rank8]:                                 ^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 684, in _CallAndUpdateTrace

[rank8]:     component = fn(*varargs, **kwargs)

[rank8]:                 ^^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/workspace/train.py”, line 84, in train

[rank8]:     trainer.train()

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 2164, in train

[rank8]:     return inner_training_loop(

[rank8]:            ^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 2473, in _inner_training_loop

[rank8]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)

[rank8]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 5130, in get_batch_samples

[rank8]:     batch_samples += [next(epoch_iterator)]

[rank8]:                       ^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/accelerate/data_loader.py”, line 552, in iter

[rank8]:     current_batch = next(dataloader_iter)

[rank8]:                     ^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py”, line 701, in next

[rank8]:     data = self._next_data()

[rank8]:            ^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py”, line 757, in _next_data

[rank8]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration

[rank8]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py”, line 55, in fetch

[rank8]:     return self.collate_fn(data)

[rank8]:            ^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/transformers/data/data_collator.py”, line 92, in default_data_collator

[rank8]:     return torch_default_data_collator(features)

[rank8]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[rank8]:   File “/usr/local/lib/python3.11/dist-packages/transformers/data/data_collator.py”, line 158, in torch_default_data_collator

[rank8]:     batch[k] = torch.tensor([f[k] for f in features])

[rank8]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[rank8]: ValueError: expected sequence of length 530 at dim 1 (got 551)

W1220 14:20:39.722000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8217 closing signal SIGTERM

W1220 14:20:39.724000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8218 closing signal SIGTERM

W1220 14:20:39.725000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8219 closing signal SIGTERM

W1220 14:20:39.725000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8220 closing signal SIGTERM

W1220 14:20:39.726000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8221 closing signal SIGTERM

W1220 14:20:39.726000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8222 closing signal SIGTERM

W1220 14:20:39.727000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8223 closing signal SIGTERM

W1220 14:20:39.727000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8224 closing signal SIGTERM

W1220 14:20:39.728000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8225 closing signal SIGTERM

E1220 14:20:41.664000 8047 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 8216) of binary: /usr/bin/python

Traceback (most recent call last):

File “/usr/local/bin/accelerate”, line 8, in 

sys.exit(main())

^^^^^^

File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py”, line 48, in main

args.func(args)

File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py”, line 1153, in launch_command

deepspeed_launcher(args)

File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py”, line 846, in deepspeed_launcher

distrib_run.run(args)

File “/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py”, line 910, in run

elastic_launch(

File “/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py”, line 138, in call

return launch_agent(self._config, self._entrypoint, list(args))

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File “/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:

<NO_OTHER_FAILURES>
Root Cause (first observed failure):

[0]:

time      : 2024-12-20_14:20:39

host      : 0f070c2247bf

rank      : 0 (local_rank: 0)

exitcode  : -6 (pid: 8216)

error_file: <N/A>

traceback : Signal 6 (SIGABRT) received by PID 8216

Amerehei · January 3, 2025, 4:34pm

UP

jaydeepb · July 21, 2025, 6:20am

Hi! could you share how did you solve this?

Topic		Replies	Views
Issues with using DeepSpeed on multiple GPUs DeepSpeed	2	2621	September 9, 2022
Multi-node training 🤗Accelerate	2	3135	January 16, 2023
Eval freezes on local multi GPU Deepspeed run DeepSpeed	4	2931	April 28, 2021
Question about using trainer with DeepSpeed 🤗Transformers	0	479	April 25, 2023
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6642	July 31, 2023

Timeout Issue with DeepSpeed on Multiple GPUs

Config file

The Problem

What I’ve Tried

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-20_14:20:39 host : 0f070c2247bf rank : 0 (local_rank: 0) exitcode : -6 (pid: 8216) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 8216

Related topics

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:20:39
host : 0f070c2247bf
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 8216)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 8216