Hi Guys,
i have been trying to run a training on multiple GPU(2) and one node and iβm facing issues as below, could please some one help me to understand the issue?
Error log:
Dataset({
features: [βchunked_textβ, βindex_level_0β],
num_rows: 1608
})
Dataset({
features: [βchunked_textβ, βindex_level_0β],
num_rows: 3443
})
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1608/1608 [00:01<00:00, 1288.62 examples/s]
Map: 73%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2504/3443 [00:01<00:00, 1343.47 examples/s][W815 19:01:18.589377736 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3443/3443 [00:02<00:00, 1360.10 examples/s]
[W815 19:01:19.031385478 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:39<00:00, 9.87s/it]
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:38<00:00, 9.63s/it]
starting the training
starting the training
Currently training with a batch size of: 2
The following columns in the training set donβt have a corresponding argument in PeftModelForCausalLM.forward
and have been ignored: index_level_0. If index_level_0 are not expected by PeftModelForCausalLM.forward
, you can safely ignore this message.
local_host:1648840:1648840 [0] NCCL INFO Bootstrap : Using eth0:172.23.61.152<0>
local_host:1648840:1648840 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
local_host:1648840:1648840 [0] NCCL INFO cudaDriverVersion 12020
local_host:1648840:1648840 [0] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 βinitialization errorβ
NCCL version 2.20.5+cuda12.4
local_host:1648840:1649517 [0] NCCL INFO Failed to open libibverbs.so[.1]
local_host:1648840:1649517 [0] NCCL INFO NET/Socket : Using [0]eth0:172.23.61.152<0>
local_host:1648840:1649517 [0] NCCL INFO Using non-device net plugin version 0
local_host:1648840:1649517 [0] NCCL INFO Using network Socket
local_host:1648841:1648841 [1] NCCL INFO cudaDriverVersion 12020
local_host:1648841:1648841 [1] misc/cudawrap.cc:36 NCCL WARN Cuda failure 3 βinitialization errorβ
local_host:1648841:1648841 [1] NCCL INFO Bootstrap : Using eth0:172.23.61.152<0>
local_host:1648841:1648841 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
local_host:1648841:1649518 [1] NCCL INFO Failed to open libibverbs.so[.1]
local_host:1648841:1649518 [1] NCCL INFO NET/Socket : Using [0]eth0:172.23.61.152<0>
local_host:1648841:1649518 [1] NCCL INFO Using non-device net plugin version 0
local_host:1648841:1649518 [1] NCCL INFO Using network Socket
local_host:1648841:1649518 [1] NCCL INFO comm 0x55c8834a85c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 65000 commId 0xb28bd9d738f58bd1 - Init START
local_host:1648840:1649517 [0] NCCL INFO comm 0x55ee7350ba50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xb28bd9d738f58bd1 - Init START
local_host:1648841:1649518 [1] NCCL INFO comm 0x55c8834a85c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
local_host:1648840:1649517 [0] NCCL INFO comm 0x55ee7350ba50 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
local_host:1648841:1649518 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
local_host:1648840:1649517 [0] NCCL INFO Channel 00/02 : 0 1
local_host:1648841:1649518 [1] NCCL INFO P2P Chunksize set to 131072
local_host:1648840:1649517 [0] NCCL INFO Channel 01/02 : 0 1
local_host:1648840:1649517 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
local_host:1648840:1649517 [0] NCCL INFO P2P Chunksize set to 131072
local_host:1648840:1649517 [0] NCCL INFO Channel 00 : 0[0] β 1[1] via SHM/direct/direct
local_host:1648840:1649517 [0] NCCL INFO Channel 01 : 0[0] β 1[1] via SHM/direct/direct
local_host:1648841:1649518 [1] NCCL INFO Channel 00 : 1[1] β 0[0] via SHM/direct/direct
local_host:1648841:1649518 [1] NCCL INFO Channel 01 : 1[1] β 0[0] via SHM/direct/direct
local_host:1648841:1649518 [1] NCCL INFO Connected all rings
local_host:1648840:1649517 [0] NCCL INFO Connected all rings
local_host:1648840:1649517 [0] NCCL INFO Connected all trees
local_host:1648841:1649518 [1] NCCL INFO Connected all trees
local_host:1648841:1649518 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
local_host:1648841:1649518 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
local_host:1648840:1649517 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
local_host:1648840:1649517 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
local_host:1648840:1649517 [0] NCCL INFO comm 0x55ee7350ba50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 17000 commId 0xb28bd9d738f58bd1 - Init COMPLETE
local_host:1648841:1649518 [1] NCCL INFO comm 0x55c8834a85c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 65000 commId 0xb28bd9d738f58bd1 - Init COMPLETE
[rank1]:[W815 19:12:03.444115389 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 2 - No such file or directory).
[rank1]: Traceback (most recent call last):
[rank1]: File β/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.pyβ, line 202, in
[rank1]: main()
[rank1]: File β/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.pyβ, line 199, in main
[rank1]: trainer.train()
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.pyβ, line 1948, in train
[rank1]: return inner_training_loop(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.pyβ, line 2095, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1311, in prepare
[rank1]: result = tuple(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1312, in
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1188, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1485, in prepare_model
[rank1]: model = FSDP(model, **kwargs)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.pyβ, line 483, in init
[rank1]: _auto_wrap(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.pyβ, line 102, in _auto_wrap
[rank1]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: [Previous line repeated 2 more times]
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 562, in _recursive_wrap
[rank1]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 491, in _wrap
[rank1]: return wrapper_cls(module, *kwargs)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.pyβ, line 509, in init
[rank1]: _init_param_handle_from_module(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.pyβ, line 596, in _init_param_handle_from_module
[rank1]: _sync_module_params_and_buffers(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.pyβ, line 1094, in _sync_module_params_and_buffers
[rank1]: _sync_params_and_buffers(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/utils.pyβ, line 326, in _sync_params_and_buffers
[rank1]: dist._broadcast_coalesced(
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key β1β, but store->get(β1β) got error: Socket Timeout
[rank1]: Exception raised from doWait at β¦/torch/csrc/distributed/c10d/TCPStore.cpp:570 (most recent call first):
[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6ac1377f86 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank1]: frame #1: + 0x16583cb (0x7f6aa8c333cb in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6aad2e5b82 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6aad2e6d71 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7f6a73bd6f6f in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7f6a73be2d4c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at:: Tensor> >&, c10d::BroadcastOptions const&) + 0x643 (0x7f6a73bef833 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #11: + 0x5cb44e6 (0x7f6aad28f4e6 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #12: + 0x5cbf796 (0x7f6aad29a796 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #13: + 0x52dfa0b (0x7f6aac8baa0b in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #14: + 0x52dd284 (0x7f6aac8b8284 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #15: + 0x1adf2b8 (0x7f6aa90ba2b8 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #16: + 0x5cc46aa (0x7f6aad29f6aa in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #17: + 0x5cd428c (0x7f6aad2af28c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #18: c10d::broadcast_coalesced(c10::intrusive_ptr<c10d:: ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d:: ProcessGroup> > const&, c10::ArrayRef<at:: Tensor>, unsigned long, int) + 0x7a4 (0x7f6aad2fbbc4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #19: + 0xd55e0e (0x7f6ac055ee0e in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #20: + 0x4b00e4 (0x7f6abfcb90e4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #21: + 0x15adae (0x55c845dbbdae in /usr/bin/python3)
[rank1]: frame #22: _PyObject_MakeTpCall + 0x25b (0x55c845db252b in /usr/bin/python3)
[rank1]: frame #23: _PyEval_EvalFrameDefault + 0x6f0b (0x55c845dab16b in /usr/bin/python3)
[rank1]: frame #24: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #25: _PyEval_EvalFrameDefault + 0x19b6 (0x55c845da5c16 in /usr/bin/python3)
[rank1]: frame #26: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #27: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #28: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #29: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #30: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #31: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #32: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #33: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: frame #34: PyObject_Call + 0xbb (0x55c845dcae9b in /usr/bin/python3)
[rank1]: frame #35: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #36: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #37: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #38: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #39: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #40: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #41: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #42: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #43: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #44: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #45: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #46: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #47: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #48: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #49: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #50: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #51: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #52: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #53: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #54: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #55: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #56: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #57: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #58: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #59: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #60: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #61: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #62: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[rank1]: Traceback (most recent call last):
[rank1]: File β/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.pyβ, line 202, in
[rank1]: main()
[rank1]: File β/home/kamuhammad/AI_Projects/Demo_llm/training_multi_GPU.pyβ, line 199, in main
[rank1]: trainer.train()
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.pyβ, line 1948, in train
[rank1]: return inner_training_loop(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/transformers/trainer.pyβ, line 2095, in _inner_training_loop
[rank1]: self.model = self.accelerator.prepare(self.model)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1311, in prepare
[rank1]: result = tuple(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1312, in
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1188, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/accelerator.pyβ, line 1485, in prepare_model
[rank1]: model = FSDP(model, **kwargs)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.pyβ, line 483, in init
[rank1]: _auto_wrap(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.pyβ, line 102, in _auto_wrap
[rank1]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 544, in _recursive_wrap
[rank1]: wrapped_child, num_wrapped_params = _recursive_wrap(
[rank1]: [Previous line repeated 2 more times]
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 562, in _recursive_wrap
[rank1]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.pyβ, line 491, in _wrap
[rank1]: return wrapper_cls(module, *kwargs)
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.pyβ, line 509, in init
[rank1]: _init_param_handle_from_module(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.pyβ, line 596, in _init_param_handle_from_module
[rank1]: _sync_module_params_and_buffers(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.pyβ, line 1094, in _sync_module_params_and_buffers
[rank1]: _sync_params_and_buffers(
[rank1]: File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/utils.pyβ, line 326, in _sync_params_and_buffers
[rank1]: dist._broadcast_coalesced(
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key β1β, but store->get(β1β) got error: Socket Timeout
[rank1]: Exception raised from doWait at β¦/torch/csrc/distributed/c10d/TCPStore.cpp:570 (most recent call first):
[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6ac1377f86 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank1]: frame #1: + 0x16583cb (0x7f6aa8c333cb in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6aad2e5b82 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6aad2e6d71 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6aad29b7c1 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7f6a73bd6f6f in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7f6a73be2d4c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator >&, c10d::BroadcastOptions const&) + 0x643 (0x7f6a73bef833 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #11: + 0x5cb44e6 (0x7f6aad28f4e6 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #12: + 0x5cbf796 (0x7f6aad29a796 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #13: + 0x52dfa0b (0x7f6aac8baa0b in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #14: + 0x52dd284 (0x7f6aac8b8284 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #15: + 0x1adf2b8 (0x7f6aa90ba2b8 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #16: + 0x5cc46aa (0x7f6aad29f6aa in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #17: + 0x5cd428c (0x7f6aad2af28c in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #18: c10d::broadcast_coalesced(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d:: ProcessGroup> > const&, c10::ArrayRef<at: Tensor>, unsigned long, int) + 0x7a4 (0x7f6aad2fbbc4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #19: + 0xd55e0e (0x7f6ac055ee0e in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #20: + 0x4b00e4 (0x7f6abfcb90e4 in /home/kamuhammad/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #21: + 0x15adae (0x55c845dbbdae in /usr/bin/python3)
[rank1]: frame #22: _PyObject_MakeTpCall + 0x25b (0x55c845db252b in /usr/bin/python3)
[rank1]: frame #23: _PyEval_EvalFrameDefault + 0x6f0b (0x55c845dab16b in /usr/bin/python3)
[rank1]: frame #24: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #25: _PyEval_EvalFrameDefault + 0x19b6 (0x55c845da5c16 in /usr/bin/python3)
[rank1]: frame #26: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #27: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #28: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #29: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #30: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #31: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #32: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #33: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: frame #34: PyObject_Call + 0xbb (0x55c845dcae9b in /usr/bin/python3)
[rank1]: frame #35: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #36: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #37: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #38: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #39: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #40: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #41: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #42: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #43: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #44: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #45: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #46: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #47: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #48: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #49: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #50: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #51: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #52: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #53: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #54: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #55: PyObject_Call + 0x122 (0x55c845dcaf02 in /usr/bin/python3)
[rank1]: frame #56: _PyEval_EvalFrameDefault + 0x2a49 (0x55c845da6ca9 in /usr/bin/python3)
[rank1]: frame #57: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #58: _PyEval_EvalFrameDefault + 0x6d5 (0x55c845da4935 in /usr/bin/python3)
[rank1]: frame #59: _PyFunction_Vectorcall + 0x7c (0x55c845dbc6ac in /usr/bin/python3)
[rank1]: frame #60: _PyObject_FastCallDictTstate + 0x16d (0x55c845db176d in /usr/bin/python3)
[rank1]: frame #61: + 0x1657a4 (0x55c845dc67a4 in /usr/bin/python3)
[rank1]: frame #62: + 0x1518db (0x55c845db28db in /usr/bin/python3)
[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue.
wandb:
View run fallen-fog-27 at:
wandb:
View project at:
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at:
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")
! See
Sending process 1648840 closing signal SIGTERM
E0815 19:12:12.350000 140204064797120 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1648841) of binary: /usr/bin/python3
Traceback (most recent call last):
File β/home/kamuhammad/.local/bin/accelerateβ, line 8, in
sys.exit(main())
File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.pyβ, line 48, in main
args.func(args)
File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/commands/launch.pyβ, line 1093, in launch_command
multi_gpu_launcher(args)
File β/home/kamuhammad/.local/lib/python3.10/site-packages/accelerate/commands/launch.pyβ, line 734, in multi_gpu_launcher
distrib_run.run(args)
File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/run.pyβ, line 892, in run
elastic_launch(
File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.pyβ, line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File β/home/kamuhammad/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.pyβ, line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
training_multi_GPU.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-08-15_19:12:10
host : local_host
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1648841)
error_file: <N/A>
traceback : To enable traceback see:
as you can see from the log the issue happens strait after launch.