Help solving RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO in pods

I’m encountering the following error:

ceback (most recent call last):
  File "/data/miranebr-sandbox/AI4Lean/py_src/train/sft/sft_train.py", line 244, in <module>
    fire.Fire(main)
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/AI4Lean/py_src/train/sft/sft_train.py", line 222, in main
    trainer.train()
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 3363, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 190, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/replicate.py", line 83, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
  2%|██▏                                                                                     | 750/30030 [1:26:07<56:02:25,  6.89s/it]

I can rerun the code with that flag but it takes 1.5 hours to get the same error. But no websearch helped. Perhaps someone has already seen this bug and can help the community?

For now I will restart or kill my kubernete pods/~docker containers and hope it fixes it.

Anyone has a more efficient solution?

NCCL errors are a bit tricky to deal with. I usually got those errors due to invalid Pytorch versions that weren’t compatible with my CUDA drivers.

You could double check those versions perhaps, and set the flag to get more clues in your logs.