Timeout Issue with DeepSpeed on Multiple GPUs

Hi everyone,

I’m currently using DeepSpeed to train my model and encountering an issue when scaling up the number of GPUs. Here’s the command I’m using:

accelerate launch --config_file CONFIG_FILE_PATH my_script.py  

Config file

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 10
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The Problem

  • When I run my code on 2 GPUs with num_processes:2, everything works fine.
  • When I scale up to 4 GPUs or 10 GPUs, the script times out after 15 minutes with no progress in training and no W&B log.

Here’s what I’ve noticed:

  1. The GPUs show 100% usage, but there’s only minimal VRAM usage.
  2. No output or training progress is logged before the timeout.

What I’ve Tried

  • Using zero_stage=1.
  • Using zero_stage=3.

Unfortunately, these configurations haven’t resolved the issue.

Logs
[2024-12-20 14:04:18,013] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] 
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] *****************************************
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1220 14:04:19.058000 8047 torch/distributed/run.py:793] *****************************************
[2024-12-20 14:04:29,429] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,159] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,354] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,399] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:30,433] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,568] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,568] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,575] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,600] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,659] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:30,700] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-20 14:04:31,226] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,365] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,490] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,539] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,575] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,575] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-12-20 14:04:31,582] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,692] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,717] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-12-20 14:04:31,739] [INFO] [comm.py:652:init_distributed] cdb=None
[rank6]:[E1220 14:14:37.438112374 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank6]:[E1220 14:14:37.438336003 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank2]:[E1220 14:14:37.442507458 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
[rank2]:[E1220 14:14:37.442713701 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank1]:[E1220 14:14:37.457598340 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600016 milliseconds before timing out.
[rank1]:[E1220 14:14:37.457813862 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank7]:[E1220 14:14:37.457947898 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[rank7]:[E1220 14:14:37.458200735 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank8]:[E1220 14:14:37.460657147 ProcessGroupNCCL.cpp:616] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[rank8]:[E1220 14:14:37.460864558 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 8] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank9]:[E1220 14:14:37.485040965 ProcessGroupNCCL.cpp:616] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
[rank9]:[E1220 14:14:37.485146880 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 9] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank0]:[E1220 14:14:37.490184113 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
[rank0]:[E1220 14:14:37.490437745 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank3]:[E1220 14:14:37.500092023 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank3]:[E1220 14:14:37.500319630 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank4]:[E1220 14:14:37.501516335 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600066 milliseconds before timing out.
[rank4]:[E1220 14:14:37.501715737 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank5]:[E1220 14:14:37.505478470 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600074 milliseconds before timing out.
[rank5]:[E1220 14:14:37.505672530 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank0]:[E1220 14:20:17.814397241 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1.
[rank0]:[E1220 14:20:17.814446641 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1220 14:20:17.814469986 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1220 14:20:17.819325164 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration > >) + 0x282 (0x787b95e19772 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x787b95e20bb3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x787b95e2261d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4:  + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #5:  + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x787b95e19772 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x787b95e20bb3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x787b95e2261d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x787b95a8f71b in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank8]: Traceback (most recent call last):
[rank8]: File “/workspace/train.py”, line 94, in
[rank8]: fire.Fire(train)
[rank8]: File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 135, in Fire
[rank8]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 468, in _Fire
[rank8]: component, remaining_args = _CallAndUpdateTrace(
[rank8]: ^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 684, in _CallAndUpdateTrace
[rank8]: component = fn(*varargs, **kwargs)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/workspace/train.py”, line 84, in train
[rank8]: trainer.train()
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 2164, in train
[rank8]: return inner_training_loop(
[rank8]: ^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 2473, in _inner_training_loop
[rank8]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 5130, in get_batch_samples
[rank8]: batch_samples += [next(epoch_iterator)]
[rank8]: ^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/accelerate/data_loader.py”, line 552, in iter
[rank8]: current_batch = next(dataloader_iter)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py”, line 701, in next
[rank8]: data = self._next_data()
[rank8]: ^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py”, line 757, in _next_data
[rank8]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py”, line 55, in fetch
[rank8]: return self.collate_fn(data)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/data/data_collator.py”, line 92, in default_data_collator
[rank8]: return torch_default_data_collator(features)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/data/data_collator.py”, line 158, in torch_default_data_collator
[rank8]: batch[k] = torch.tensor([f[k] for f in features])
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: ValueError: expected sequence of length 530 at dim 1 (got 551)
W1220 14:20:39.722000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8217 closing signal SIGTERM
W1220 14:20:39.724000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8218 closing signal SIGTERM
W1220 14:20:39.725000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8219 closing signal SIGTERM
W1220 14:20:39.725000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8220 closing signal SIGTERM
W1220 14:20:39.726000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8221 closing signal SIGTERM
W1220 14:20:39.726000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8222 closing signal SIGTERM
W1220 14:20:39.727000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8223 closing signal SIGTERM
W1220 14:20:39.727000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8224 closing signal SIGTERM
W1220 14:20:39.728000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8225 closing signal SIGTERM
E1220 14:20:41.664000 8047 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 8216) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/accelerate”, line 8, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py”, line 48, in main
args.func(args)
File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py”, line 1153, in launch_command
deepspeed_launcher(args)
File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py”, line 846, in deepspeed_launcher
distrib_run.run(args)
File “/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py”, line 910, in run
elastic_launch(
File “/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py”, line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:20:39
host : 0f070c2247bf
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 8216)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 8216

1 Like

UP :rocket: :rocket: :rocket: :rocket:

1 Like