I am trying to run a simple script to analyze attention scores of Llama-2-7B-32K model on Jetstream2 HPC cluster. Until yesterday, I was able to run scripts but when I suddenly started seeing the following error:
Traceback (most recent call last):
File "/home/exouser/Squeezed-Attention/offline_clustering.py", line 8, in <module>
from utils.model_parse import (
File "/home/exouser/Squeezed-Attention/utils/model_parse.py", line 1, in <module>
from transformers import AutoModelForCausalLM, LlamaForCausalLM, OPTForCausalLM
File "<frozen importlib._bootstrap>", line 1055, in _handle_fromlist
File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1500, in __getattr__
value = getattr(module, name)
File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1499, in __getattr__
module = self._get_module(self._class_to_module[name])
File "/home/exouser/Squeezed-Attention/transformers/src/transformers/utils/import_utils.py", line 1511, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
cudaErrorInitializationError: initialization error
I am quite new to this area but I was not sure the root cause of this error and could not find anything that is relevant online.
Does anyone have any suggestions on how to tackle this?
It seems like there’s a CUDA initialization error. Are you sure you have access to a GPU on the node you’re running on? Try running:
nvidia-smi
If that doesn’t work, you’re likely on a node without GPU or the environment isn’t set up correctly. Make sure to run on a GPU node and check that CUDA and drivers are properly loaded.
Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai
Thank you for your response. Here is the output of nvidia-smi command
(fixedprompt) (base) exouser@possibly-right-crawdad:~/Squeezed-Attention/LongBench$ nvidia-smi
Thu Mar 20 16:58:13 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:04:00.0 Off | 0 |
| N/A 25C P0 52W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
It does seem like I am running this on a GPU node. nvcc --version returns
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
If I understand correctly, there is no incompatibility between cuda version and driver version. However, below command indicates that the cuda is not available even though nvidia-smi shows that there is a node available?
(fixedprompt) (base) exouser@possibly-right-crawdad:~/Squeezed-Attention/LongBench$ python
Python 3.9.21 | packaged by conda-forge | (main, Dec 5 2024, 13:51:40)
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/home/exouser/miniconda3/envs/fixedprompt/lib/python3.9/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
>>> import torch
>>> print(torch.version.cuda)
12.4
>>> print(torch.cuda.is_available())
/home/exouser/miniconda3/envs/fixedprompt/lib/python3.9/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
It does print out the version of cuda installed with the right version in my environment. However, it still shows me that it is not available.
I ran the following command but I still see the same error. I tried by restarting from scratch including creating a new conda environment, installing packages, and then ran the script. For the first couple of times, it actually ran fine without seeing initialization error but at certain point I started seeing initialization error. The only thing I noticed was that the RAM of the root disk of the cluster spiked but that is the only thing. This also happened when I created a new python script and shell script to perform attention analysis. I do not know if this is related however.
The most common reasons for this are that the default version of the library on the server side has changed, or that there is a mismatch with the new version of the library that was installed when the server was restarted.
The most suspicious thing is always PyTorch, but you might want to try downgrading Transformers, etc. as well.
No, if you’ve already fixed the version, it would be more suspicious to update some other library that hasn’t been fixed. If it’s a Hugging Face’s Transformers-around library…
Also, this is probably not related to the outside of Hugging Face, but there have been cases recently where it works if you re-specify the HF_HOME environment variable. This is because libraries such as Transformers and huggingface_hub refer to it and work.
Oh, but the .dev version may be a bit suspicious, as there are multiple versions even if the version number is the same.
I am beginning to think that it is a cluster issue but I am not sure. When I create a new instance, I can run the code without running into this issue for a bit and then, after a while, it just starts running into initialization issue.