Hi everyone,
I’m currently using DeepSpeed to train my model and encountering an issue when scaling up the number of GPUs. Here’s the command I’m using:
accelerate launch --config_file CONFIG_FILE_PATH my_script.py
Config file
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 10
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
The Problem
- When I run my code on 2 GPUs with
num_processes:2
, everything works fine. - When I scale up to 4 GPUs or 10 GPUs, the script times out after 15 minutes with no progress in training and no W&B log.
Here’s what I’ve noticed:
- The GPUs show 100% usage, but there’s only minimal VRAM usage.
- No output or training progress is logged before the timeout.
What I’ve Tried
- Using
zero_stage=1
. - Using
zero_stage=3
.
Unfortunately, these configurations haven’t resolved the issue.
Logs
[2024-12-20 14:04:18,013] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) W1220 14:04:19.058000 8047 torch/distributed/run.py:793] W1220 14:04:19.058000 8047 torch/distributed/run.py:793] ***************************************** W1220 14:04:19.058000 8047 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1220 14:04:19.058000 8047 torch/distributed/run.py:793] ***************************************** [2024-12-20 14:04:29,429] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,159] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,354] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,399] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:30,433] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,568] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,568] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,575] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,600] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,659] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:30,700] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-12-20 14:04:31,226] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,365] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,490] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,539] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,575] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,575] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-12-20 14:04:31,582] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,692] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,717] [INFO] [comm.py:652:init_distributed] cdb=None [2024-12-20 14:04:31,739] [INFO] [comm.py:652:init_distributed] cdb=None [rank6]:[E1220 14:14:37.438112374 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [rank6]:[E1220 14:14:37.438336003 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank2]:[E1220 14:14:37.442507458 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [rank2]:[E1220 14:14:37.442713701 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank1]:[E1220 14:14:37.457598340 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [rank1]:[E1220 14:14:37.457813862 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank7]:[E1220 14:14:37.457947898 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [rank7]:[E1220 14:14:37.458200735 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank8]:[E1220 14:14:37.460657147 ProcessGroupNCCL.cpp:616] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [rank8]:[E1220 14:14:37.460864558 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 8] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank9]:[E1220 14:14:37.485040965 ProcessGroupNCCL.cpp:616] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [rank9]:[E1220 14:14:37.485146880 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 9] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank0]:[E1220 14:14:37.490184113 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [rank0]:[E1220 14:14:37.490437745 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank3]:[E1220 14:14:37.500092023 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank3]:[E1220 14:14:37.500319630 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank4]:[E1220 14:14:37.501516335 ProcessGroupNCCL.cpp:616] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank4]:[E1220 14:14:37.501715737 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank5]:[E1220 14:14:37.505478470 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. [rank5]:[E1220 14:14:37.505672530 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank0]:[E1220 14:20:17.814397241 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 146, last completed NCCL work: -1. [rank0]:[E1220 14:20:17.814446641 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E1220 14:20:17.814469986 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E1220 14:20:17.819325164 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration > >) + 0x282 (0x787b95e19772 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x787b95e20bb3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x787b95e2261d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=262668288, NumelOut=262668288, Timeout(ms)=600000) ran for 600068 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x787b95e19772 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x787b95e20bb3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x787b95e2261d in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x787be076c446 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x787b95a8f71b in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x787be0bd75c0 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x787be2b3eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x787be2bcfa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)[rank8]: Traceback (most recent call last):
[rank8]: File “/workspace/train.py”, line 94, in
[rank8]: fire.Fire(train)
[rank8]: File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 135, in Fire
[rank8]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 468, in _Fire
[rank8]: component, remaining_args = _CallAndUpdateTrace(
[rank8]: ^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/fire/core.py”, line 684, in _CallAndUpdateTrace
[rank8]: component = fn(*varargs, **kwargs)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/workspace/train.py”, line 84, in train
[rank8]: trainer.train()
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 2164, in train
[rank8]: return inner_training_loop(
[rank8]: ^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 2473, in _inner_training_loop
[rank8]: batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/trainer.py”, line 5130, in get_batch_samples
[rank8]: batch_samples += [next(epoch_iterator)]
[rank8]: ^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/accelerate/data_loader.py”, line 552, in iter
[rank8]: current_batch = next(dataloader_iter)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py”, line 701, in next
[rank8]: data = self._next_data()
[rank8]: ^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py”, line 757, in _next_data
[rank8]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/torch/utils/data/_utils/fetch.py”, line 55, in fetch
[rank8]: return self.collate_fn(data)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/data/data_collator.py”, line 92, in default_data_collator
[rank8]: return torch_default_data_collator(features)
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: File “/usr/local/lib/python3.11/dist-packages/transformers/data/data_collator.py”, line 158, in torch_default_data_collator
[rank8]: batch[k] = torch.tensor([f[k] for f in features])
[rank8]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank8]: ValueError: expected sequence of length 530 at dim 1 (got 551)
W1220 14:20:39.722000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8217 closing signal SIGTERM
W1220 14:20:39.724000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8218 closing signal SIGTERM
W1220 14:20:39.725000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8219 closing signal SIGTERM
W1220 14:20:39.725000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8220 closing signal SIGTERM
W1220 14:20:39.726000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8221 closing signal SIGTERM
W1220 14:20:39.726000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8222 closing signal SIGTERM
W1220 14:20:39.727000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8223 closing signal SIGTERM
W1220 14:20:39.727000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8224 closing signal SIGTERM
W1220 14:20:39.728000 8047 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 8225 closing signal SIGTERM
E1220 14:20:41.664000 8047 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 8216) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/accelerate”, line 8, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py”, line 48, in main
args.func(args)
File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py”, line 1153, in launch_command
deepspeed_launcher(args)
File “/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py”, line 846, in deepspeed_launcher
distrib_run.run(args)
File “/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py”, line 910, in run
elastic_launch(
File “/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py”, line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:train.py FAILED
Failures:
<NO_OTHER_FAILURES>Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:20:39
host : 0f070c2247bf
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 8216)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 8216