[SOLVED] accelerate.Accelerator(): CUDA error: invalid device ordinal

Howdy. At the rather innoculous line of code,

    accelerator = accelerate.Accelerator()

I get the following errors when running on my cluster:

Traceback (most recent call last):
  File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
Traceback (most recent call last):
  File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
        accelerator = accelerate.Accelerator()torch._C._cuda_setDevice(device)

When I re-run with export CUDA_LAUNCH_BLOCKING=1, I get… exactly the same set of errors except without the message about setting CUDA_LAUNCH_BLOCKING=1 for debugging :rofl: .

Any ideas what might be causing this? Typically, with other training codes, I don’t have a problem with the initial accelerate.Accelerator() call. Not sure what I broke. :person_shrugging:

(Perhaps there’s an error elsewhere and this set of messages is not indicative of the true error? Perhaps it’s a SLURM job/allocation error? Perhaps I keep getting the same bad pod/GPU set?)

Thanks!

1 Like

So what it was was, I was running this as a batch job, but I was submitting the job from a β€˜compute’ node instead of a β€˜control’ node. And apparently the former had not been setup to provide the right set of information.

10 months later and I’m seeing this error again.

Problem is intermittent. I’m now hitting this error within an 8-gpu srun run of bash: sometimes accelerate launch ... works fine. But other times, without changing my code, I get the error in the listed title of this thread.

The error occurs at the line accelerator = Accelerator() :

β”‚ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/accelerate/accelerator.py:361 in         β”‚
β”‚ __init__                                                                                         β”‚
β”‚                                                                                                  β”‚
β”‚    358 β”‚   β”‚   β”‚   β”‚   β”‚   β”‚   self.fp8_recipe_handler = handler                                 β”‚
β”‚    359 β”‚   β”‚                                                                                     β”‚
β”‚    360 β”‚   β”‚   kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {}   β”‚
β”‚ ❱  361 β”‚   β”‚   self.state = AcceleratorState(                                                    β”‚
β”‚    362 β”‚   β”‚   β”‚   mixed_precision=mixed_precision,                                              β”‚
β”‚    363 β”‚   β”‚   β”‚   cpu=cpu,                                                                      β”‚
β”‚    364 β”‚   β”‚   β”‚   dynamo_plugin=dynamo_plugin,                                                  β”‚
β”‚                                                                                                  β”‚
β”‚ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/accelerate/state.py:549 in __init__      β”‚
β”‚                                                                                                  β”‚
β”‚   546 β”‚   β”‚   if parse_flag_from_env("ACCELERATE_USE_CPU"):                                      β”‚
β”‚   547 β”‚   β”‚   β”‚   cpu = True                                                                     β”‚
β”‚   548 β”‚   β”‚   if PartialState._shared_state == {}:                                               β”‚
β”‚ ❱ 549 β”‚   β”‚   β”‚   PartialState(cpu, **kwargs)                                                    β”‚
β”‚   550 β”‚   β”‚   self.__dict__.update(PartialState._shared_state)                                   β”‚
β”‚   551 β”‚   β”‚   self._check_initialized(mixed_precision, cpu)                                      β”‚
β”‚   552 β”‚   β”‚   if not self.initialized:                                                           β”‚
β”‚                                                                                                  β”‚
β”‚ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/accelerate/state.py:149 in __init__      β”‚
β”‚                                                                                                  β”‚
β”‚   146 β”‚   β”‚   β”‚   β”‚   self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))           β”‚
β”‚   147 β”‚   β”‚   β”‚   β”‚   if self.device is None:                                                    β”‚
β”‚   148 β”‚   β”‚   β”‚   β”‚   β”‚   self.device = torch.device("cuda", self.local_process_index)           β”‚
β”‚ ❱ 149 β”‚   β”‚   β”‚   β”‚   torch.cuda.set_device(self.device)                                         β”‚
β”‚   150 β”‚   β”‚   β”‚   elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_S   β”‚
β”‚   151 β”‚   β”‚   β”‚   β”‚   self.distributed_type = DistributedType.MULTI_CPU                          β”‚
β”‚   152 β”‚   β”‚   β”‚   β”‚   if is_ccl_available() and get_int_from_env(["CCL_WORKER_COUNT"], 0) > 0:   β”‚
β”‚                                                                                                  β”‚
β”‚ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/torch/cuda/__init__.py:326 in set_device β”‚
β”‚                                                                                                  β”‚
β”‚   323 β”‚   """                                                                                    β”‚
β”‚   324 β”‚   device = _get_device_index(device)                                                     β”‚
β”‚   325 β”‚   if device >= 0:                                                                        β”‚
β”‚ ❱ 326 β”‚   β”‚   torch._C._cuda_setDevice(device)                                                   β”‚
β”‚   327                                                                                            β”‚
β”‚   328                                                                                            β”‚
β”‚   329 def get_device_name(device: Optional[_device_t] = None) -> str:                            β”‚
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: invalid device ordinal

Running accelerate config doesn’t seem to fix it. My β€˜[SOLVED]’ title form last year is a misnomer, as I now see it wasn’t really a solution. It was rather β€œkeep trying nodes until you get one that works”.

Would love to have some help understanding WHY this error occurs when using Accelerate, and how to actually fix it.

1 Like

This would be better as a github issue: Sign in to GitHub Β· GitHub

Please fill in the form to completion and we’ll help answer it how we can. (There’s not enough information presented here in this thread).

Thanks @muellerzr . Will do. I was finally able to get a β€œgood node” that would run without this error by just adding to the SLURM --exclude list, and after about 6 tries it worked.

I’ll open the issue to see if we can maybe figure out what distinguishes a β€œgood node” from a β€œbad node”

This happened to me when running training jobs through SLURM too. My code works with srun (interactive) but not when trying to run a long-running job with sbatch. what’s the difference?

[For future reference, as the error message is not very informative itself.]

In my case, it was a problem of a badly configured accelerate config. I add a configuration for 8 gpus and was launching a job for 4 gpus.

Specifying the correct number of GPUs in the accelerate launch --num_processes=$NUM_GPUS command fixed it.

4 Likes

Hi, I meet a very similar case with you that failing with slurm…
Any solution here? What’s your github issue?

Hi jxm, I meet a very similar case with you that failing with slurm…
Any solution here? What’s your github issue?

I think in my case I was configuring my job to run on 4 GPUs and it was getting scheduled to run on a 2-GPU machine and then failing.

1 Like

In my case this was caused by distributed_type: 'MULTI_GPU'. Changing it to distributed_type: 'NO' fixed it for me