[SOLVED] accelerate.Accelerator(): CUDA error: invalid device ordinal

drscotthawley · August 11, 2022, 3:05am

Howdy. At the rather innoculous line of code,

    accelerator = accelerate.Accelerator()

I get the following errors when running on my cluster:

Traceback (most recent call last):
  File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
Traceback (most recent call last):
  File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
        accelerator = accelerate.Accelerator()torch._C._cuda_setDevice(device)

When I re-run with export CUDA_LAUNCH_BLOCKING=1, I get… exactly the same set of errors except without the message about setting CUDA_LAUNCH_BLOCKING=1 for debugging .

Any ideas what might be causing this? Typically, with other training codes, I don’t have a problem with the initial accelerate.Accelerator() call. Not sure what I broke.

(Perhaps there’s an error elsewhere and this set of messages is not indicative of the true error? Perhaps it’s a SLURM job/allocation error? Perhaps I keep getting the same bad pod/GPU set?)

Thanks!

drscotthawley · August 11, 2022, 3:12am

So what it was was, I was running this as a batch job, but I was submitting the job from a ‘compute’ node instead of a ‘control’ node. And apparently the former had not been setup to provide the right set of information.

drscotthawley · May 31, 2023, 7:19pm

10 months later and I’m seeing this error again.

Problem is intermittent. I’m now hitting this error within an 8-gpu srun run of bash: sometimes accelerate launch ... works fine. But other times, without changing my code, I get the error in the listed title of this thread.

The error occurs at the line accelerator = Accelerator() :

│ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/accelerate/accelerator.py:361 in         │
│ __init__                                                                                         │
│                                                                                                  │
│    358 │   │   │   │   │   │   self.fp8_recipe_handler = handler                                 │
│    359 │   │                                                                                     │
│    360 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {}   │
│ ❱  361 │   │   self.state = AcceleratorState(                                                    │
│    362 │   │   │   mixed_precision=mixed_precision,                                              │
│    363 │   │   │   cpu=cpu,                                                                      │
│    364 │   │   │   dynamo_plugin=dynamo_plugin,                                                  │
│                                                                                                  │
│ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/accelerate/state.py:549 in __init__      │
│                                                                                                  │
│   546 │   │   if parse_flag_from_env("ACCELERATE_USE_CPU"):                                      │
│   547 │   │   │   cpu = True                                                                     │
│   548 │   │   if PartialState._shared_state == {}:                                               │
│ ❱ 549 │   │   │   PartialState(cpu, **kwargs)                                                    │
│   550 │   │   self.__dict__.update(PartialState._shared_state)                                   │
│   551 │   │   self._check_initialized(mixed_precision, cpu)                                      │
│   552 │   │   if not self.initialized:                                                           │
│                                                                                                  │
│ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/accelerate/state.py:149 in __init__      │
│                                                                                                  │
│   146 │   │   │   │   self.local_process_index = int(os.environ.get("LOCAL_RANK", -1))           │
│   147 │   │   │   │   if self.device is None:                                                    │
│   148 │   │   │   │   │   self.device = torch.device("cuda", self.local_process_index)           │
│ ❱ 149 │   │   │   │   torch.cuda.set_device(self.device)                                         │
│   150 │   │   │   elif get_int_from_env(["PMI_SIZE", "OMPI_COMM_WORLD_SIZE", "MV2_COMM_WORLD_S   │
│   151 │   │   │   │   self.distributed_type = DistributedType.MULTI_CPU                          │
│   152 │   │   │   │   if is_ccl_available() and get_int_from_env(["CCL_WORKER_COUNT"], 0) > 0:   │
│                                                                                                  │
│ /admin/home-shawley/envs/aa/lib/python3.9/site-packages/torch/cuda/__init__.py:326 in set_device │
│                                                                                                  │
│   323 │   """                                                                                    │
│   324 │   device = _get_device_index(device)                                                     │
│   325 │   if device >= 0:                                                                        │
│ ❱ 326 │   │   torch._C._cuda_setDevice(device)                                                   │
│   327                                                                                            │
│   328                                                                                            │
│   329 def get_device_name(device: Optional[_device_t] = None) -> str:                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: invalid device ordinal

Running accelerate config doesn’t seem to fix it. My ‘[SOLVED]’ title form last year is a misnomer, as I now see it wasn’t really a solution. It was rather “keep trying nodes until you get one that works”.

Would love to have some help understanding WHY this error occurs when using Accelerate, and how to actually fix it.

muellerzr · May 31, 2023, 9:34pm

This would be better as a github issue: Sign in to GitHub · GitHub

Please fill in the form to completion and we’ll help answer it how we can. (There’s not enough information presented here in this thread).

drscotthawley · May 31, 2023, 10:30pm

Thanks @muellerzr . Will do. I was finally able to get a “good node” that would run without this error by just adding to the SLURM --exclude list, and after about 6 tries it worked.

I’ll open the issue to see if we can maybe figure out what distinguishes a “good node” from a “bad node”

jxm · June 3, 2023, 4:01pm

This happened to me when running training jobs through SLURM too. My code works with srun (interactive) but not when trying to run a long-running job with sbatch. what’s the difference?

mespinosami · December 6, 2023, 2:43pm

[For future reference, as the error message is not very informative itself.]

In my case, it was a problem of a badly configured accelerate config. I add a configuration for 8 gpus and was launching a job for 4 gpus.

Specifying the correct number of GPUs in the accelerate launch --num_processes=$NUM_GPUS command fixed it.

lzy337 · January 2, 2024, 12:10pm

Hi, I meet a very similar case with you that failing with slurm…
Any solution here? What’s your github issue?

lzy337 · January 2, 2024, 12:11pm

Hi jxm, I meet a very similar case with you that failing with slurm…
Any solution here? What’s your github issue?

jxm · February 28, 2024, 4:08pm

I think in my case I was configuring my job to run on 4 GPUs and it was getting scheduled to run on a 2-GPU machine and then failing.

xavriley · May 13, 2024, 2:26pm

In my case this was caused by distributed_type: 'MULTI_GPU'. Changing it to distributed_type: 'NO' fixed it for me

tuananhbui89 · July 6, 2024, 9:59pm

Thanks. This works for me!

Topic		Replies	Views
Why my Accelerate just doesn't work? 🤗Accelerate	2	6306	March 7, 2022
Use CUDA_VISIBLE_DEVICES with accelarator 🤗Accelerate	1	1182	August 30, 2021
How to use specific gpu in accelerate? 🤗Accelerate	10	8549	April 25, 2024
Troubleshooting help? Everything just hangs 🤗Accelerate	2	3465	July 12, 2022
Accelerate on 1 GPU 🤗Accelerate	2	1896	April 8, 2022

[SOLVED] accelerate.Accelerator(): CUDA error: invalid device ordinal

Related topics