Unexpected error from cudaGetDeviceCount()

Hello I encountered this error while looking for finetuning a model on RunPod. The pod description is as follow :
1x RTX 4090 24 GB VRAM, 8 max, 61 GB RAM , 16 vCPU

The full error is as follow :

RuntimeError                              Traceback (most recent call last)
Cell In[3], line 4
      1 from accelerate import FullyShardedDataParallelPlugin, Accelerator
      2 from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
----> 4 fsdp_plugin = FullyShardedDataParallelPlugin(
      5     state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
      6     optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
      7 )
      9 accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
File <string>:18, in _init_(self, sharding_strategy, backward_prefetch, mixed_precision_policy, auto_wrap_policy, cpu_offload, ignored_modules, state_dict_type, state_dict_config, optim_state_dict_config, limit_all_gathers, use_orig_params, param_init_fn, sync_module_states, forward_prefetch, activation_checkpointing)
File /usr/local/lib/python3.10/dist-packages/accelerate/utils/dataclasses.py:1016, in FullyShardedDataParallelPlugin.__post_init__(self)
   1014     device = torch.npu.current_device()
   1015 elif is_cuda_available():
-> 1016     device = torch.cuda.current_device()
   1017 elif is_xpu_available():
   1018     device = torch.xpu.current_device()
File /usr/local/lib/python3.10/dist-packages/torch/cuda/_init_.py:769, in current_device()
    767 def current_device() -> int:
    768     r"""Returns the index of a currently selected device."""
--> 769     _lazy_init()
    770     return torch._C._cuda_getDevice()
File /usr/local/lib/python3.10/dist-packages/torch/cuda/_init_.py:298, in _lazy_init()
    296 if "CUDA_MODULE_LOADING" not in os.environ:
    297     os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 298 torch._C._cuda_init()
    299 # Some of the queued calls may reentrantly call _lazy_init();
    300 # we need to just return without initializing in that case.
    301 # However, we must not let any other threads in!
    302 _tls.is_initializing = True
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

From my research I still don’t understand the error. Here is a command output that may help

root@30bf346bbbdf:/# nvidia-smi
Tue Jan 30 15:36:56 2024       
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  Off |
|  0%   30C    P8    28W / 450W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

The code I am trying to run from a jupyter notebook is as follow :

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets matplotlib

from datasets import load_dataset

train_dataset = load_dataset('json', data_files='dataset.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation.jsonl', split='train') 

from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Any help is precious thank you

I found the solution with the provider which is totally unrelated to accelerate framework. They don’t guarantee that the GPU will be free if the container is up and running.


1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.