Howdy. At the rather innoculous line of code,
accelerator = accelerate.Accelerator()
I get the following errors when running on my cluster:
Traceback (most recent call last):
File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
accelerator = accelerate.Accelerator()
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
accelerator = accelerate.Accelerator()
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
accelerator = accelerate.Accelerator()
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/accelerator.py", line 232, in __init__
self.state = AcceleratorState(
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
self.state = AcceleratorState(
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
self.state = AcceleratorState(
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/accelerate/state.py", line 144, in __init__
torch.cuda.set_device(self.device)
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
Traceback (most recent call last):
File "/fsx/shawley/code/RAVE/train_rave_accel.py", line 43, in <module>
torch.cuda.set_device(self.device)
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
torch.cuda.set_device(self.device)
File "/home/shawley/envs/shazbot/lib64/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
accelerator = accelerate.Accelerator()torch._C._cuda_setDevice(device)
When I re-run with export CUDA_LAUNCH_BLOCKING=1
, I get⦠exactly the same set of errors except without the message about setting CUDA_LAUNCH_BLOCKING=1
for debugging .
Any ideas what might be causing this? Typically, with other training codes, I donβt have a problem with the initial accelerate.Accelerator()
call. Not sure what I broke.
(Perhaps thereβs an error elsewhere and this set of messages is not indicative of the true error? Perhaps itβs a SLURM job/allocation error? Perhaps I keep getting the same bad pod/GPU set?)
Thanks!