[SOLVED] accelerate.Accelerator(): CUDA error: invalid device ordinal

Howdy. At the rather innoculous line of code,

    accelerator = accelerate.Accelerator()

I get the following errors when running on my cluster:

Traceback (most recent call last):
  File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    accelerator = accelerate.Accelerator()
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    self.state = AcceleratorState(
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
Traceback (most recent call last):
  File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch.cuda.set_device(self.device)
  File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
        accelerator = accelerate.Accelerator()torch._C._cuda_setDevice(device)

When I re-run with export CUDA_LAUNCH_BLOCKING=1, I get… exactly the same set of errors except without the message about setting CUDA_LAUNCH_BLOCKING=1 for debugging :rofl: .

Any ideas what might be causing this? Typically, with other training codes, I don’t have a problem with the initial accelerate.Accelerator() call. Not sure what I broke. :person_shrugging:

(Perhaps there’s an error elsewhere and this set of messages is not indicative of the true error? Perhaps it’s a SLURM job/allocation error? Perhaps I keep getting the same bad pod/GPU set?)

Thanks!

So what it was was, I was running this as a batch job, but I was submitting the job from a ‘compute’ node instead of a ‘control’ node. And apparently the former had not been setup to provide the right set of information.