Runtime Error: Cuda Initialization

hxn137 · March 20, 2025, 2:18pm

Hi all,

I am trying to run a simple script to analyze attention scores of Llama-2-7B-32K model on Jetstream2 HPC cluster. Until yesterday, I was able to run scripts but when I suddenly started seeing the following error:

Traceback (most recent call last):
  File "/home/exouser/Squeezed-Attention/offline_clustering.py", line 8, in <module>
    from utils.model_parse import (
  File "/home/exouser/Squeezed-Attention/utils/model_parse.py", line 1, in <module>
    from transformers import AutoModelForCausalLM, LlamaForCausalLM, OPTForCausalLM
  File "<frozen importlib._bootstrap>", line 1055, in _handle_fromlist
  File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1500, in __getattr__
    value = getattr(module, name)
  File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1499, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1511, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
cudaErrorInitializationError: initialization error

I am quite new to this area but I was not sure the root cause of this error and could not find anything that is relevant online.

Does anyone have any suggestions on how to tackle this?

aaraya · March 20, 2025, 4:46pm

It seems like there’s a CUDA initialization error. Are you sure you have access to a GPU on the node you’re running on? Try running:

nvidia-smi

If that doesn’t work, you’re likely on a node without GPU or the environment isn’t set up correctly. Make sure to run on a GPU node and check that CUDA and drivers are properly loaded.

Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

hxn137 · March 20, 2025, 5:02pm

Hi Adrian,

Thank you for your response. Here is the output of nvidia-smi command

(fixedprompt) (base) exouser@possibly-right-crawdad:~/Squeezed-Attention/LongBench$ nvidia-smi
Thu Mar 20 16:58:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:04:00.0 Off |                    0 |
| N/A   25C    P0             52W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

It does seem like I am running this on a GPU node. nvcc --version returns

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

If I understand correctly, there is no incompatibility between cuda version and driver version. However, below command indicates that the cuda is not available even though nvidia-smi shows that there is a node available?

(fixedprompt) (base) exouser@possibly-right-crawdad:~/Squeezed-Attention/LongBench$ python
Python 3.9.21 | packaged by conda-forge | (main, Dec  5 2024, 13:51:40) 
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/home/exouser/miniconda3/envs/fixedprompt/lib/python3.9/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False

aaraya · March 20, 2025, 5:09pm

Mmm, it looks like PyTorch in your environment might not have GPU support. To check, run:

import torch
print(torch.version.cuda)

If it prints None, it means you installed the CPU-only version of PyTorch. In that case, reinstall with GPU support using:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

hxn137 · March 20, 2025, 5:22pm

Below is the output

>>> import torch
>>> print(torch.version.cuda)
12.4
>>> print(torch.cuda.is_available())
/home/exouser/miniconda3/envs/fixedprompt/lib/python3.9/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False

It does print out the version of cuda installed with the right version in my environment. However, it still shows me that it is not available.

aaraya · March 20, 2025, 8:45pm

Mmm, maybe the environment isn’t picking up the right driver. Try this just to be sure:

module purge
module load cuda/12.4
conda activate fixedprompt
python -c "import torch; print(torch.cuda.is_available())"

Sometimes that resets things and makes PyTorch see the GPU.

Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

hxn137 · March 21, 2025, 1:45pm

I ran the following command but I still see the same error. I tried by restarting from scratch including creating a new conda environment, installing packages, and then ran the script. For the first couple of times, it actually ran fine without seeing initialization error but at certain point I started seeing initialization error. The only thing I noticed was that the RAM of the root disk of the cluster spiked but that is the only thing. This also happened when I created a new python script and shell script to perform attention analysis. I do not know if this is related however.

John6666 · March 21, 2025, 2:22pm

The most common reasons for this are that the default version of the library on the server side has changed, or that there is a mismatch with the new version of the library that was installed when the server was restarted.

The most suspicious thing is always PyTorch, but you might want to try downgrading Transformers, etc. as well.

pip install transformers==4.48.3

hxn137 · March 21, 2025, 2:42pm

I am currently using transformers=4.40.dev right now. Could this version be the issue?

John6666 · March 21, 2025, 3:01pm

No, if you’ve already fixed the version, it would be more suspicious to update some other library that hasn’t been fixed. If it’s a Hugging Face’s Transformers-around library…

pip install -U transformers peft accelerate huggingface_hub numpy<2 sentencepiece torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Also, this is probably not related to the outside of Hugging Face, but there have been cases recently where it works if you re-specify the HF_HOME environment variable. This is because libraries such as Transformers and huggingface_hub refer to it and work.

Oh, but the .dev version may be a bit suspicious, as there are multiple versions even if the version number is the same.

hxn137 · March 21, 2025, 3:38pm

The error still persists after re-installing packages and re-set HF_HOME to ~/.cache/huggingface

John6666 · March 21, 2025, 4:05pm

Hmmm… perhaps Jetstream2 HPC cluster issue…?

github.com/Genesis-Embodied-AI/Genesis

CUDA Error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected while calling init (cuInit)

opened 01:06AM - 20 Dec 24 UTC

closed 12:01AM - 23 Feb 25 UTC

TTPlanetPig

I have the same issue with my cuda 12.1 run under wsl2 with both python 310-312 …tested nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 I got the error like: [E 12/20/24 08:45:40.689 2984] [cuda_driver.h:operator()@92] CUDA Error CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected while calling init (cuInit) However I can run in windows with cuda device working properly. but rendering is not supported,

hxn137 · March 24, 2025, 12:32pm

I am beginning to think that it is a cluster issue but I am not sure. When I create a new instance, I can run the code without running into this issue for a bit and then, after a while, it just starts running into initialization issue.

John6666 · March 24, 2025, 1:54pm

Hmm…

Topic		Replies	Views
I am getting Runtime error when i am trying to fine tune the Code LLama on custom dataset Intermediate	0	16	July 26, 2024
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP) 🤗Transformers	11	3631	October 1, 2024
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	54	June 16, 2025
CUDA error BERT Beginners	9	7939	October 15, 2024
Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine Intermediate	5	11068	December 21, 2023

Runtime Error: Cuda Initialization

Related topics