Runtime Error: Cuda Initialization

Hi all,

I am trying to run a simple script to analyze attention scores of Llama-2-7B-32K model on Jetstream2 HPC cluster. Until yesterday, I was able to run scripts but when I suddenly started seeing the following error:

Traceback (most recent call last):
  File "/home/exouser/Squeezed-Attention/offline_clustering.py", line 8, in <module>
    from utils.model_parse import (
  File "/home/exouser/Squeezed-Attention/utils/model_parse.py", line 1, in <module>
    from transformers import AutoModelForCausalLM, LlamaForCausalLM, OPTForCausalLM
  File "<frozen importlib._bootstrap>", line 1055, in _handle_fromlist
  File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1500, in __getattr__
    value = getattr(module, name)
  File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1499, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1511, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
cudaErrorInitializationError: initialization error

I am quite new to this area but I was not sure the root cause of this error and could not find anything that is relevant online.

Does anyone have any suggestions on how to tackle this?

1 Like

It seems like there’s a CUDA initialization error. Are you sure you have access to a GPU on the node you’re running on? Try running:

nvidia-smi

If that doesn’t work, you’re likely on a node without GPU or the environment isn’t set up correctly. Make sure to run on a GPU node and check that CUDA and drivers are properly loaded.

Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

1 Like

Hi Adrian,

Thank you for your response. Here is the output of nvidia-smi command

(fixedprompt) (base) exouser@possibly-right-crawdad:~/Squeezed-Attention/LongBench$ nvidia-smi
Thu Mar 20 16:58:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:04:00.0 Off |                    0 |
| N/A   25C    P0             52W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

It does seem like I am running this on a GPU node. nvcc --version returns

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

If I understand correctly, there is no incompatibility between cuda version and driver version. However, below command indicates that the cuda is not available even though nvidia-smi shows that there is a node available?

(fixedprompt) (base) exouser@possibly-right-crawdad:~/Squeezed-Attention/LongBench$ python
Python 3.9.21 | packaged by conda-forge | (main, Dec  5 2024, 13:51:40) 
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/home/exouser/miniconda3/envs/fixedprompt/lib/python3.9/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False
2 Likes

Mmm, it looks like PyTorch in your environment might not have GPU support. To check, run:

import torch
print(torch.version.cuda)

If it prints None, it means you installed the CPU-only version of PyTorch. In that case, reinstall with GPU support using:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

1 Like

Below is the output

>>> import torch
>>> print(torch.version.cuda)
12.4
>>> print(torch.cuda.is_available())
/home/exouser/miniconda3/envs/fixedprompt/lib/python3.9/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False

It does print out the version of cuda installed with the right version in my environment. However, it still shows me that it is not available.

2 Likes

Mmm, maybe the environment isn’t picking up the right driver. Try this just to be sure:

module purge
module load cuda/12.4
conda activate fixedprompt
python -c "import torch; print(torch.cuda.is_available())"

Sometimes that resets things and makes PyTorch see the GPU.


Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

1 Like

I ran the following command but I still see the same error. I tried by restarting from scratch including creating a new conda environment, installing packages, and then ran the script. For the first couple of times, it actually ran fine without seeing initialization error but at certain point I started seeing initialization error. The only thing I noticed was that the RAM of the root disk of the cluster spiked but that is the only thing. This also happened when I created a new python script and shell script to perform attention analysis. I do not know if this is related however.

1 Like

The most common reasons for this are that the default version of the library on the server side has changed, or that there is a mismatch with the new version of the library that was installed when the server was restarted.

The most suspicious thing is always PyTorch, but you might want to try downgrading Transformers, etc. as well.

pip install transformers==4.48.3
2 Likes

I am currently using transformers=4.40.dev right now. Could this version be the issue?

1 Like

No, if you’ve already fixed the version, it would be more suspicious to update some other library that hasn’t been fixed. If it’s a Hugging Face’s Transformers-around library…

pip install -U transformers peft accelerate huggingface_hub numpy<2 sentencepiece torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Also, this is probably not related to the outside of Hugging Face, but there have been cases recently where it works if you re-specify the HF_HOME environment variable. This is because libraries such as Transformers and huggingface_hub refer to it and work.

Oh, but the .dev version may be a bit suspicious, as there are multiple versions even if the version number is the same.

The error still persists after re-installing packages and re-set HF_HOME to ~/.cache/huggingface

1 Like

Hmmm… perhaps Jetstream2 HPC cluster issue…?

I am beginning to think that it is a cluster issue but I am not sure. When I create a new instance, I can run the code without running into this issue for a bit and then, after a while, it just starts running into initialization issue.

1 Like

Hmm…