RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

shadowshadow · September 30, 2024, 1:03pm

I am using the following code to use lora to finetune Llama-7B

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model

ds = Dataset.load_from_disk("../data/alpaca_data_zh/")
tokenizer = AutoTokenizer.from_pretrained("../model/Llama-2-7b-ms")
def precess_func:
   ... # process data
tokenized_ds = ds.map(process_func, remove_columns=ds.column_names)

model = AutoModelForCausalLM.from_pretrained("../model/Llama-2-7b-ms", low_cpu_mem_usage=True, 
                                             torch_dtype=torch.half, device_map="auto")
config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, config)
model.enable_input_require_grads()
args = TrainingArguments(
    output_dir="./chatbot",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1,
    gradient_checkpointing=True
)
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=tokenized_ds.select(range(6000)),
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
trainer.train()

every thing goes well except trainer.train(), it reports the following warning

/home/wtx/miniconda3/envs/llm/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libstdc++.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libm.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)

I have tried to add /usr/lib/x86_64-linux-gnu which contains libstdc++.so.6 and libm.so.6 to $LD_LIBRARY_PATH.But It still can’t find them and the training reports the following error:

Traceback (most recent call last):
  File "/home/wtx/workspace/python_project/LLM/Transformers/train.py", line 154, in <module>
    trainer.train()
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
    loss.backward(**kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

I’d appreciate it if someone could give me some advice.

Here are my library version, please tell me if you need more information:

os: ubuntu 22.04
pytorch version: 2.1.0
cuda: 11.8
accelerate: 0.34.2
transformers: 4.44.2

The full log is here:

/home/wtx/miniconda3/envs/llm/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libstdc++.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libm.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::runtime_error::~runtime_error()@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `__gxx_personality_v0@CXXABI_1.3'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::ostream::tellp()@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::string::substr(unsigned long, unsigned long) const@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::string::_M_replace_aux(unsigned long, unsigned long, unsigned long, char)@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `dlopen'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `typeinfo for bool@CXXABI_1.3'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::__throw_logic_error(char const*)@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `VTT for std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@GLIBCXX_3.4'
... # similar output
collect2: error: ld returned 1 exit status

Traceback (most recent call last):
  File "/home/wtx/workspace/python_project/LLM/Transformers/train.py", line 154, in <module>
    trainer.train()
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
    loss.backward(**kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
... # similar output

John6666 · September 30, 2024, 1:14pm

This issue sounds tricky. The workaround, if it can be worked around, is to change the CUDA version or save VRAM usage.
If there is no workaround, it may be an unresolved bug.

In your case, you have already specified device_map=“auto”, so as long as the accelerate library is properly installed with pip, you should be able to offload as much as possible.
The only thing left to do is to reduce the amount of data to be passed on somehow.

github.com/pytorch/pytorch

CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx

opened 05:48AM - 17 Apr 24 UTC

rangehow

module: cuda triaged module: cublas

### 🐛 Describe the bug I met a problem similar to #94294 when using torch.mul…tiprocessing ```python RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP) ``` The following code should be quite easy to reproduce. All you need to do is make sure CUDA VISIBLE DEVICES >1 . ```python import torch from torch import bfloat16 import torch.distributed as dist import torch.multiprocessing as mp from torch.utils.data import Dataset,DataLoader import functools from transformers import AutoTokenizer,DefaultDataCollator,GenerationConfig,PreTrainedModel,AutoModelForSeq2SeqLM,AutoModelForCausalLM,AutoConfig,DataCollatorWithPadding import logging from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES from tqdm import tqdm # from accelerate import find_executable_batch_size logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') logger = logging.getLogger(__name__) class DefaultDataset(Dataset): def __init__(self,data,tokenizer): self.data=tokenizer(data,return_tensors='pt',padding=True) def __getitem__(self,idx): return {'input_ids':self.data['input_ids'][idx]} def __len__(self): return self.data['input_ids'].size(0) class NiuInference: def __init__(self,model_dir,data,dtype=bfloat16,dataset=None,data_collator=None,output_path='niuinference.out',auto_batch_size=True,batch_size=1,generation_config=None): self.model_dir=model_dir self.dtype=dtype self.data=data self.dataset=dataset self.data_collator=data_collator self.output_path=output_path self.batch_size=batch_size self.auto_batch_size=auto_batch_size self.generation_config=generation_config def _load_model_and_tokenizer(self,device): print(self.dtype) config=AutoConfig.from_pretrained(self.model_dir) if config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES: model=AutoModelForCausalLM.from_pretrained(self.model_dir,torch_dtype=self.dtype) else: model=AutoModelForSeq2SeqLM.from_pretrained(self.model_dir,torch_dtype=self.dtype) model.to(device) tokenizer=AutoTokenizer.from_pretrained(self.model_dir) return model,tokenizer def get_pred(self,rank,out_path,data,dict): batch_size=2 try: device = torch.device(f'cuda:{rank}') model, tokenizer = self._load_model_and_tokenizer(device) if self.dataset is not None: dataset=self.dataset(data=data,tokenizer=tokenizer) else: dataset=DefaultDataset(data=data,tokenizer=tokenizer) if self.data_collator is not None: collator=self.data_collator(tokenizer,model=model,padding=True) else: collator= DataCollatorWithPadding(tokenizer) dataloader=DataLoader(dataset,batch_size,collate_fn=collator,pin_memory=True,num_workers=0) result=[] for input in tqdm(dataloader): input.to(device) print(input) output = model.generate( input_ids=input['input_ids'], attention_mask=input['attention_mask'], num_beams=5, do_sample=False, temperature=1.0, max_new_tokens=512, ) pred = tokenizer.batch_decode(output,skip_special_tokens=True) print(pred) result+=pred dict[f'{rank}']=result except Exception as e: print('error',device) raise def split_list(self,lst, n): avg = len(lst) / float(n) return [lst[int(avg * i):int(avg * (i + 1))] for i in range(n)] def run(self,): world_size = min(torch.cuda.device_count(),len(self.data)) # corner case， data<available GPU num data_subsets = self.split_list(self.data,world_size) print(data_subsets) processes = [] manager = mp.Manager() record_dict = manager.dict() for rank in range(world_size): p = mp.Process(target=self.get_pred, args=(rank,self.output_path,data_subsets[rank],record_dict)) p.start() processes.append(p) for p in processes: p.join() with open(self.output_path, "w", encoding="utf-8") as f: for rank in range(world_size): for r in record_dict[f'{rank}']: f.write(r.replace('\n','\\n')+'\n') if __name__=='__main__': mp.set_start_method('spawn') i=NiuInference(model_dir='Google/t5-v1_1-large',data=['hello,how is your day','my wish is that you happy','from scratch',]) i.run() ``` ### Versions this is my environment ```shell nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Feb__7_19:32:13_PST_2023 Cuda compilation tools, release 12.1, V12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0 torch Version: 2.2.1 Python version: 3.10.13 transformers version: 4.39.0.dev0 ``` cc @ptrblck @csarofeen @xwang233

github.com/facebookresearch/DiT

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul

opened 01:16AM - 20 Mar 23 UTC

flytomatolll

I'm sorry to bother you. I first run train.py in my own dataset and get a xxx.pt…. Then I use the xxx.pt to run sample.py. But I got this: `RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`` do you know how to fix it? thank you

nielsr · September 30, 2024, 1:36pm

Hi,

Would recommend the following: Training Model on CPU instead of GPU - #2 by sgugger.

shadowshadow · September 30, 2024, 1:37pm

magically, after change device_map='auto' to device_map='cuda', everything works fine

John6666 · September 30, 2024, 1:41pm

Is it a bug in the accelerate library…?
Buggy behavior around accelerate is not reported to the developer because basically no one knows if it’s really a bug in accelerate or not…

shadowshadow · September 30, 2024, 1:42pm

thks for your help! BTW, after I change device_map='cuda', it only use one GPU to train, can you tell me how to use multiple GPU in this situtation

shadowshadow · September 30, 2024, 1:44pm

Actually I don’t know, I am a new bee to transformers and I copy this code file from others. In his video, everthing works fine.

John6666 · September 30, 2024, 1:46pm

device_map (Dict[str, Union[str, int, torch.device]]) — A dictionary mapping module names in the models state_dict to the device they should go to. Note that "disk" is accepted even if it’s not a proper value for torch.device.

In other words, device_map=[0, 1] (Maybe I’m wrong. See the manual above.) or something like that should work. In this case, it means CUDA:0 and CUDA:1 are used. (I don’t have a multi-GPU PC, so maybe)

John6666 · September 30, 2024, 1:51pm

If the code was working, is it the environment that is suspicious… is the version of the library the same as its author to some extent? Or is it simply your environment?
If the version of the library is different even by 0.1, there may be quite a few specific bugs. I don’t remember every single one of them…
Also, I don’t think Linux is so bad, but CUDA installations are often busted, especially in Windows environments.

shadowshadow · September 30, 2024, 1:54pm

it’s simply my environment

John6666 · September 30, 2024, 1:57pm

Yeah, that’s probably it.
I don’t think I can reproduce the bugs and normal operation properly outside of a virtual environment either.

system · October 1, 2024, 1:57am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	40	June 16, 2025
CUDA Runtime Error in the Middle of Training Intermediate	1	1235	March 30, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) DeepSpeed	5	3454	August 26, 2024
cuBLAS error 13 when running code with langchain.llms on GPU 🤗Accelerate	0	268	May 6, 2024
Runtime Error: Cuda Initialization 🤗Transformers	13	207	March 24, 2025

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

Related topics