Hello I encountered this error while looking for finetuning a model on RunPod. The pod description is as follow :
1x RTX 4090 24 GB VRAM, 8 max, 61 GB RAM , 16 vCPU
The full error is as follow :
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[3], line 4
1 from accelerate import FullyShardedDataParallelPlugin, Accelerator
2 from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
----> 4 fsdp_plugin = FullyShardedDataParallelPlugin(
5 state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
6 optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
7 )
9 accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
File <string>:18, in _init_(self, sharding_strategy, backward_prefetch, mixed_precision_policy, auto_wrap_policy, cpu_offload, ignored_modules, state_dict_type, state_dict_config, optim_state_dict_config, limit_all_gathers, use_orig_params, param_init_fn, sync_module_states, forward_prefetch, activation_checkpointing)
File /usr/local/lib/python3.10/dist-packages/accelerate/utils/dataclasses.py:1016, in FullyShardedDataParallelPlugin.__post_init__(self)
1014 device = torch.npu.current_device()
1015 elif is_cuda_available():
-> 1016 device = torch.cuda.current_device()
1017 elif is_xpu_available():
1018 device = torch.xpu.current_device()
File /usr/local/lib/python3.10/dist-packages/torch/cuda/_init_.py:769, in current_device()
767 def current_device() -> int:
768 r"""Returns the index of a currently selected device."""
--> 769 _lazy_init()
770 return torch._C._cuda_getDevice()
File /usr/local/lib/python3.10/dist-packages/torch/cuda/_init_.py:298, in _lazy_init()
296 if "CUDA_MODULE_LOADING" not in os.environ:
297 os.environ["CUDA_MODULE_LOADING"] = "LAZY"
--> 298 torch._C._cuda_init()
299 # Some of the queued calls may reentrantly call _lazy_init();
300 # we need to just return without initializing in that case.
301 # However, we must not let any other threads in!
302 _tls.is_initializing = True
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
From my research I still don’t understand the error. Here is a command output that may help
root@30bf346bbbdf:/# nvidia-smi
Tue Jan 30 15:36:56 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:81:00.0 Off | Off |
| 0% 30C P8 28W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The code I am trying to run from a jupyter notebook is as follow :
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets matplotlib
from datasets import load_dataset
train_dataset = load_dataset('json', data_files='dataset.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation.jsonl', split='train')
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
Any help is precious thank you