I’m encountering the following error:
ceback (most recent call last):
File "/data/miranebr-sandbox/AI4Lean/py_src/train/sft/sft_train.py", line 244, in <module>
fire.Fire(main)
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/AI4Lean/py_src/train/sft/sft_train.py", line 222, in main
trainer.train()
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 3318, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 3363, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 190, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/replicate.py", line 83, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
2%|██▏ | 750/30030 [1:26:07<56:02:25, 6.89s/it]
I can rerun the code with that flag but it takes 1.5 hours to get the same error. But no websearch helped. Perhaps someone has already seen this bug and can help the community?
For now I will restart or kill my kubernete pods/~docker containers and hope it fixes it.
Anyone has a more efficient solution?