torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

kamu03 · August 15, 2024, 6:53pm

Hi Guys,

i have been trying to run a training on multiple GPU(2) and one node and i’m facing issues as below, could please some one help me to understand the issue?

Error log:
Dataset({
features: [‘chunked_text’, ‘index_level_0’],
num_rows: 1608
})
Dataset({
features: [‘chunked_text’, ‘index_level_0’],
num_rows: 3443
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1608/1608 [00:01<00:00, 1288.62 examples/s]
Map: 73%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 2504/3443 [00:01<00:00, 1343.47 examples/s][W815 19:01:18.589377736 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3443/3443 [00:02<00:00, 1360.10 examples/s]
[W815 19:01:19.031385478 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:39<00:00, 9.87s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:38<00:00, 9.63s/it]
starting the training
starting the training
Currently training with a batch size of: 2
The following columns in the training set don’t have a corresponding argument in PeftModelForCausalLM.forward and have been ignored: index_level_0. If index_level_0 are not expected by PeftModelForCausalLM.forward, you can safely ignore this message.
local_host:1648840:1648840 [0] NCCL INFO Bootstrap : Using eth0:172.23.61.152<0>
local_host:1648840:1648840 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
local_host:1648840:1648840 [0] NCCL INFO cudaDriverVersion 12020

local_host:1648840:1648840 [0] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 ‘initialization error’
NCCL version 2.20.5+cuda12.4
local_host:1648840:1649517 [0] NCCL INFO Failed to open libibverbs.so[.1]
local_host:1648840:1649517 [0] NCCL INFO NET/Socket : Using [0]eth0:172.23.61.152<0>
local_host:1648840:1649517 [0] NCCL INFO Using non-device net plugin version 0
local_host:1648840:1649517 [0] NCCL INFO Using network Socket
local_host:1648841:1648841 [1] NCCL INFO cudaDriverVersion 12020

local_host:1648841:1648841 [1] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 ‘initialization error’
local_host:1648841:1648841 [1] NCCL INFO Bootstrap : Using eth0:172.23.61.152<0>
local_host:1648841:1648841 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
local_host:1648841:1649518 [1] NCCL INFO Failed to open libibverbs.so[.1]
local_host:1648841:1649518 [1] NCCL INFO NET/Socket : Using [0]eth0:172.23.61.152<0>
local_host:1648841:1649518 [1] NCCL INFO Using non-device net plugin version 0
local_host:1648841:1649518 [1] NCCL INFO Using network Socket
local_host:1648841:1649518 [1] NCCL INFO comm 0x55c8834a85c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 65000 commId 0xb28bd9d738f58bd1 - Init START
local_host:1648840:1649517 [0] NCCL INFO comm 0x55ee7350ba50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xb28bd9d738f58bd1 - Init START
local_host:1648841:1649518 [1] NCCL INFO comm 0x55c8834a85c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
local_host:1648840:1649517 [0] NCCL INFO comm 0x55ee7350ba50 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
local_host:1648841:1649518 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
local_host:1648840:1649517 [0] NCCL INFO Channel 00/02 : 0 1
local_host:1648841:1649518 [1] NCCL INFO P2P Chunksize set to 131072
local_host:1648840:1649517 [0] NCCL INFO Channel 01/02 : 0 1
local_host:1648840:1649517 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
local_host:1648840:1649517 [0] NCCL INFO P2P Chunksize set to 131072
local_host:1648840:1649517 [0] NCCL INFO Channel 00 : 0[0] → 1[1] via SHM/direct/direct
local_host:1648840:1649517 [0] NCCL INFO Channel 01 : 0[0] → 1[1] via SHM/direct/direct
local_host:1648841:1649518 [1] NCCL INFO Channel 00 : 1[1] → 0[0] via SHM/direct/direct
local_host:1648841:1649518 [1] NCCL INFO Channel 01 : 1[1] → 0[0] via SHM/direct/direct
local_host:1648841:1649518 [1] NCCL INFO Connected all rings
local_host:1648840:1649517 [0] NCCL INFO Connected all rings
local_host:1648840:1649517 [0] NCCL INFO Connected all trees
local_host:1648841:1649518 [1] NCCL INFO Connected all trees
local_host:1648841:1649518 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
local_host:1648841:1649518 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
local_host:1648840:1649517 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
local_host:1648840:1649517 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
local_host:1648840:1649517 [0] NCCL INFO comm 0x55ee7350ba50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xb28bd9d738f58bd1 - Init COMPLETE
local_host:1648841:1649518 [1] NCCL INFO comm 0x55c8834a85c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 65000 commId 0xb28bd9d738f58bd1 - Init COMPLETE
[rank1]:[W815 19:12:03.444115389 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 2 - No such file or directory).
[rank1]: Traceback (most recent call last):
[rank1]: File “/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.py”, line 202, in
[rank1]: main()
[rank1]: File “/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.py”, line 199, in main
[rank1]: trainer.train()
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.py”, line 1948, in train
[rank1]: return inner_training_loop(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.py”, line 2095, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1311, in prepare
[rank1]: result = tuple(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1312, in
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1188, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1485, in prepare_model
[rank1]: model = FSDP(model, kwargs)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py”, line 483, in init
[rank1]: _auto_wrap(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py”, line 102, in _auto_wrap
[rank1]: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignore[arg-type]
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: [Previous line repeated 2 more times]
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 562, in _recursive_wrap
[rank1]: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 491, in _wrap
[rank1]: return wrapper_cls(module, kwargs)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py”, line 509, in init
[rank1]: _init_param_handle_from_module(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py”, line 596, in _init_param_handle_from_module
[rank1]: _sync_module_params_and_buffers(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py”, line 1094, in _sync_module_params_and_buffers
[rank1]: _sync_params_and_buffers(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/utils.py”, line 326, in _sync_params_and_buffers
[rank1]: dist._broadcast_coalesced(
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key ‘1’, but store->get(‘1’) got error: Socket Timeout
[rank1]: Exception raised from doWait at …/torch/csrc/distributed/c10d/TCPStore.cpp:570 (most recent call first):
[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6ac1377f86 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank1]: frame #1: + 0x16583cb (0x7f6aa8c333cb in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6aad2e5b82 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6aad2e6d71 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7f6a73bd6f6f in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7f6a73be2d4c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at:: Tensor> >&, c10d::BroadcastOptions const&) + 0x643 (0x7f6a73bef833 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #11: + 0x5cb44e6 (0x7f6aad28f4e6 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #12: + 0x5cbf796 (0x7f6aad29a796 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #13: + 0x52dfa0b (0x7f6aac8baa0b in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #14: + 0x52dd284 (0x7f6aac8b8284 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #15: + 0x1adf2b8 (0x7f6aa90ba2b8 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #16: + 0x5cc46aa (0x7f6aad29f6aa in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #17: + 0x5cd428c (0x7f6aad2af28c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #18: c10d::broadcast_coalesced(c10::intrusive_ptr<c10d:: ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d:: ProcessGroup> > const&, c10::ArrayRef<at:: Tensor>, unsigned long, int) + 0x7a4 (0x7f6aad2fbbc4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #19: + 0xd55e0e (0x7f6ac055ee0e in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #20: + 0x4b00e4 (0x7f6abfcb90e4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #21: + 0x15adae (0x55c845dbbdae in /usr/bin/python3)
[rank1]: frame #22: _PyObject_MakeTpCall + 0x25b (0x55c845db252b in /usr/bin/python3)
[rank1]: frame #23: _PyEval_EvalFrameDefault + 0x6f0b (0x55c845dab16b in /usr/bin/python3)
[rank1]: frame #24: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #25: _PyEval_EvalFrameDefault + 0x19b6 (0x55c845da5c16 in /usr/bin/python3)
[rank1]: frame #26: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #27: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #28: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #29: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #30: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #31: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #32: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #33: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: frame #34: PyObject_Call + 0xbb (0x55c845dcae9b in /usr/bin/python3)
[rank1]: frame #35: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #36: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #37: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #38: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #39: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #40: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #41: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #42: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #43: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #44: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #45: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #46: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #47: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #48: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #49: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #50: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #51: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #52: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #53: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #54: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #55: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #56: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #57: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #58: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #59: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #60: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #61: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #62: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[rank1]: Traceback (most recent call last):
[rank1]: File “/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.py”, line 202, in
[rank1]: main()
[rank1]: File “/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.py”, line 199, in main
[rank1]: trainer.train()
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.py”, line 1948, in train
[rank1]: return inner_training_loop(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.py”, line 2095, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1311, in prepare
[rank1]: result = tuple(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1312, in
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1188, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.py”, line 1485, in prepare_model
[rank1]: model = FSDP(model, kwargs)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py”, line 483, in init
[rank1]: _auto_wrap(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py”, line 102, in _auto_wrap
[rank1]: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignore[arg-type]
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: [Previous line repeated 2 more times]
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 562, in _recursive_wrap
[rank1]: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py”, line 491, in _wrap
[rank1]: return wrapper_cls(module, kwargs)
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py”, line 509, in init
[rank1]: _init_param_handle_from_module(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py”, line 596, in _init_param_handle_from_module
[rank1]: _sync_module_params_and_buffers(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py”, line 1094, in _sync_module_params_and_buffers
[rank1]: _sync_params_and_buffers(
[rank1]: File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/utils.py”, line 326, in _sync_params_and_buffers
[rank1]: dist._broadcast_coalesced(
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key ‘1’, but store->get(‘1’) got error: Socket Timeout
[rank1]: Exception raised from doWait at …/torch/csrc/distributed/c10d/TCPStore.cpp:570 (most recent call first):
[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6ac1377f86 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank1]: frame #1: + 0x16583cb (0x7f6aa8c333cb in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6aad2e5b82 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6aad2e6d71 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7f6a73bd6f6f in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7f6a73be2d4c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator >&, c10d::BroadcastOptions const&) + 0x643 (0x7f6a73bef833 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #11: + 0x5cb44e6 (0x7f6aad28f4e6 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #12: + 0x5cbf796 (0x7f6aad29a796 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #13: + 0x52dfa0b (0x7f6aac8baa0b in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #14: + 0x52dd284 (0x7f6aac8b8284 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #15: + 0x1adf2b8 (0x7f6aa90ba2b8 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #16: + 0x5cc46aa (0x7f6aad29f6aa in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #17: + 0x5cd428c (0x7f6aad2af28c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #18: c10d::broadcast_coalesced(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d:: ProcessGroup> > const&, c10::ArrayRef<at: Tensor>, unsigned long, int) + 0x7a4 (0x7f6aad2fbbc4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #19: + 0xd55e0e (0x7f6ac055ee0e in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #20: + 0x4b00e4 (0x7f6abfcb90e4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #21: + 0x15adae (0x55c845dbbdae in /usr/bin/python3)
[rank1]: frame #22: _PyObject_MakeTpCall + 0x25b (0x55c845db252b in /usr/bin/python3)
[rank1]: frame #23: _PyEval_EvalFrameDefault + 0x6f0b (0x55c845dab16b in /usr/bin/python3)
[rank1]: frame #24: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #25: _PyEval_EvalFrameDefault + 0x19b6 (0x55c845da5c16 in /usr/bin/python3)
[rank1]: frame #26: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #27: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #28: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #29: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #30: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #31: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #32: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #33: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: frame #34: PyObject_Call + 0xbb (0x55c845dcae9b in /usr/bin/python3)
[rank1]: frame #35: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #36: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #37: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #38: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #39: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #40: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #41: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #42: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #43: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #44: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #45: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #46: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #47: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #48: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #49: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #50: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #51: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #52: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #53: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #54: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #55: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #56: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #57: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #58: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #59: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #60: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #61: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #62: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue.
wandb: View run fallen-fog-27 at:
wandb: View project at:
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at:
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See
Sending process 1648840 closing signal SIGTERM
E0815 19:12:12.350000 140204064797120 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1648841) of binary: /usr/bin/python3
Traceback (most recent call last):
File “/home/kamuhammad/.local/bin/accelerate”, line 8, in
sys.exit(main())
File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py”, line 48, in main
args.func(args)
File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/commands/launch.py”, line 1093, in launch_command
multi_gpu_launcher(args)
File “/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/commands/launch.py”, line 734, in multi_gpu_launcher
distrib_run.run(args)
File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/run.py”, line 892, in run
elastic_launch(
File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training_multi_GPU.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-15_19:12:10
host : local_host
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1648841)
error_file: <N/A>
traceback : To enable traceback see:

as you can see from the log the issue happens strait after launch.

kamu03 · August 15, 2024, 6:55pm

Part of training script:

Set the data type and attention implementation.

torch_dtype = torch.bfloat16
attn_implementation = "eager"
accelerator = Accelerator() # declare the accelerate 

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_quant_storage=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically distribute the model across available GPUs
    torch_dtype=torch_dtype,
    attn_implementation=attn_implementation
)

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)

# Configure FSDP
""" fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

fsdp_config = {        
    "fsdp_offload_params": True,
    "fsdp_state_dict_type": "FULL_STATE_DICT",
    "fsdp_transformer_layer_cls_to_wrap": [
        "LlamaDecoderLayer"
    ]} """

# Defining trainer arguments
training_arguments = TrainingArguments(
    output_dir=new_model,
    per_device_train_batch_size=2,  # Start with a small batch size
    per_device_eval_batch_size=2,   # Start with a small batch size
    gradient_accumulation_steps=8,  # Adjust as needed to achieve desired effective batch size
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    #fp16=True,  # Enable mixed precision training
    #bf16=False,
    group_by_length=True,
    dataloader_num_workers=4,  # Increase for faster data loading
    #report_to="wandb",
    #fsdp="full_shard",
    #fsdp_config=fsdp_config,
    dataloader_persistent_workers=True,
    ddp_timeout=700,
    log_level="debug",
    
)

max_seq_length = 1024
#dataset_mapped = _prepare_non_packed_dataloader(tokenizer, dataset, "text", max_seq_length)

# Prepare datasets
train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["test"]

""" model, train_dataloader, eval_dataloader = accelerator.prepare(
    model, train_dataset, eval_dataset
) """

# Defining the trainer
trainer = accelerator.prepare(Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
))

print("starting the training")

trainer.train()

Accelerate configuration:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env:
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

i have the following libraries version:
accelerate==0.33.0
aiohappyeyeballs==2.3.5
aiohttp==3.10.2
aiosignal==1.3.1
asttokens==2.4.1
async-timeout==4.0.3
attrs==24.2.0
bitsandbytes==0.43.3
certifi==2024.7.4
charset-normalizer==3.3.2
click==8.1.7
comm==0.2.2
datasets==2.20.0
debugpy==1.8.5
decorator==5.1.1
dill==0.3.8
distlib==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docstring_parser==0.16
exceptiongroup==1.2.2
executing==2.0.1
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.2.0
gitdb==4.0.11
GitPython==3.1.43
huggingface-hub==0.24.5
idna==3.7
ipykernel==6.29.5
ipython==8.26.0
jedi==0.19.1
Jinja2==3.1.3
jupyter_client==8.6.2
jupyter_core==5.7.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.4.2.65
nvidia-cuda-cupti-cu12==12.4.99
nvidia-cuda-nvrtc-cu12==12.4.99
nvidia-cuda-runtime-cu12==12.4.99
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.0.44
nvidia-curand-cu12==10.3.5.119
nvidia-cusolver-cu12==11.6.0.99
nvidia-cusparse-cu12==12.3.0.142
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.4.99
packaging==24.1
pandas==2.2.2
parso==0.8.4
peft==0.12.0
pexpect==4.9.0
pillow==10.2.0
pipenv==2024.0.1
platformdirs==4.2.2
prompt_toolkit==3.0.47
protobuf==5.27.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==17.0.0
pyarrow-hotfix==0.6
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
pyzmq==26.1.0
regex==2024.7.24
requests==2.32.3
rich==13.7.1
safetensors==0.4.4
sentry-sdk==2.12.0
setproctitle==1.3.3
shtab==1.7.1
smmap==5.0.1
stack-data==0.6.3
sympy==1.12
tokenizers==0.19.1
torch==2.4.0+cu124
torchaudio==2.4.0+cu124
torchvision==0.19.0+cu124
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.0
triton==3.0.0
trl==0.9.6
typing_extensions==4.9.0
tyro==0.8.6
tzdata==2024.1
urllib3==2.2.2
virtualenv==20.26.3
wandb==0.17.6
wcwidth==0.2.13
xxhash==3.4.1
yarl==1.9.4

Topic		Replies	Views
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 10561) of binary 🤗Accelerate	4	4938	January 24, 2024
torch.distributed.elastic.multiprocessing.errors.ChildFailedError 🤗Transformers	19	40713	January 22, 2025
Error when fine-tuning on multi-gpu 🤗Transformers	1	785	February 17, 2025
Issues with Dataset Loading and Checkpoint Saving using FSDP with HuggingFace Trainer on SLURM Multi-Node Setup 🤗Accelerate	1	182	April 7, 2025
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	3056	August 8, 2024

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training_multi_GPU.py FAILED

Failures: <NO_OTHER_FAILURES>

Set the data type and attention implementation.

Related topics

Failures:
<NO_OTHER_FAILURES>