Help solving RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO in pods

brando · August 13, 2024, 7:04pm

I’m encountering the following error:

ceback (most recent call last):
  File "/data/miranebr-sandbox/AI4Lean/py_src/train/sft/sft_train.py", line 244, in <module>
    fire.Fire(main)
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/AI4Lean/py_src/train/sft/sft_train.py", line 222, in main
    trainer.train()
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/transformers/trainer.py", line 3363, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 190, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/replicate.py", line 83, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miranebr-sandbox/.virtualenvs/AI4Lean/lib/python3.11/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
  2%|██▏                                                                                     | 750/30030 [1:26:07<56:02:25,  6.89s/it]

I can rerun the code with that flag but it takes 1.5 hours to get the same error. But no websearch helped. Perhaps someone has already seen this bug and can help the community?

For now I will restart or kill my kubernete pods/~docker containers and hope it fixes it.

Anyone has a more efficient solution?

raghavm1 · August 16, 2024, 9:32pm

NCCL errors are a bit tricky to deal with. I usually got those errors due to invalid Pytorch versions that weren’t compatible with my CUDA drivers.

You could double check those versions perhaps, and set the flag to get more clues in your logs.

Topic		Replies	Views
RuntimeError: CUDA error: device-side assert triggered in training LayoutLM Models	0	1451	July 16, 2023
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! 🤗Transformers	2	139	March 25, 2025
RuntimeError: Found no NVIDIA driver on your system Spaces	3	1217	October 11, 2022
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error DeepSpeed	0	1169	March 30, 2024
RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect Beginners	0	357	March 14, 2024

Help solving RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO in pods

Related topics