RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

I am using the following code to use lora to finetune Llama-7B

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model

ds = Dataset.load_from_disk("../data/alpaca_data_zh/")
tokenizer = AutoTokenizer.from_pretrained("../model/Llama-2-7b-ms")
def precess_func:
   ... # process data
tokenized_ds = ds.map(process_func, remove_columns=ds.column_names)

model = AutoModelForCausalLM.from_pretrained("../model/Llama-2-7b-ms", low_cpu_mem_usage=True, 
                                             torch_dtype=torch.half, device_map="auto")
config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, config)
model.enable_input_require_grads()
args = TrainingArguments(
    output_dir="./chatbot",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1,
    gradient_checkpointing=True
)
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=tokenized_ds.select(range(6000)),
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
trainer.train()

every thing goes well except trainer.train(), it reports the following warning

/home/wtx/miniconda3/envs/llm/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libstdc++.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libm.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)

I have tried to add /usr/lib/x86_64-linux-gnu which contains libstdc++.so.6 and libm.so.6 to $LD_LIBRARY_PATH.But It still canā€™t find them and the training reports the following error:

Traceback (most recent call last):
  File "/home/wtx/workspace/python_project/LLM/Transformers/train.py", line 154, in <module>
    trainer.train()
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
    loss.backward(**kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

Iā€™d appreciate it if someone could give me some advice.

Here are my library version, please tell me if you need more information:

  • os: ubuntu 22.04
  • pytorch version: 2.1.0
  • cuda: 11.8
  • accelerate: 0.34.2
  • transformers: 4.44.2

The full log is here:

/home/wtx/miniconda3/envs/llm/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libstdc++.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: warning: libm.so.6, needed by /home/wtx/.local/cuda-11.8/lib64/libcufile.so, not found (try using -rpath or -rpath-link)
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::runtime_error::~runtime_error()@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `__gxx_personality_v0@CXXABI_1.3'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::ostream::tellp()@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::string::substr(unsigned long, unsigned long) const@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::string::_M_replace_aux(unsigned long, unsigned long, unsigned long, char)@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `dlopen'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `typeinfo for bool@CXXABI_1.3'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `std::__throw_logic_error(char const*)@GLIBCXX_3.4'
/home/wtx/miniconda3/envs/llm/compiler_compat/ld: /home/wtx/.local/cuda-11.8/lib64/libcufile.so: undefined reference to `VTT for std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@GLIBCXX_3.4'
... # similar output
collect2: error: ld returned 1 exit status

Traceback (most recent call last):
  File "/home/wtx/workspace/python_project/LLM/Transformers/train.py", line 154, in <module>
    trainer.train()
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
    loss.backward(**kwargs)
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/wtx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
... # similar output

1 Like

This issue sounds tricky. The workaround, if it can be worked around, is to change the CUDA version or save VRAM usage.
If there is no workaround, it may be an unresolved bug.

In your case, you have already specified device_map=ā€œautoā€, so as long as the accelerate library is properly installed with pip, you should be able to offload as much as possible.
The only thing left to do is to reduce the amount of data to be passed on somehow.

Hi,

Would recommend the following: Training Model on CPU instead of GPU - #2 by sgugger.

1 Like

magically, after change device_map='auto' to device_map='cuda', everything works fine

1 Like

Is it a bug in the accelerate libraryā€¦?
Buggy behavior around accelerate is not reported to the developer because basically no one knows if itā€™s really a bug in accelerate or notā€¦

thks for your help! BTW, after I change device_map='cuda', it only use one GPU to train, can you tell me how to use multiple GPU in this situtation :slight_smile:

Actually I donā€™t know, I am a new bee to transformers and I copy this code file from others. In his video, everthing works fine.

device_map (Dict[str, Union[str, int, torch.device]]) ā€” A dictionary mapping module names in the models state_dict to the device they should go to. Note that "disk" is accepted even if itā€™s not a proper value for torch.device.

In other words, device_map=[0, 1] (Maybe Iā€™m wrong. See the manual above.) or something like that should work. In this case, it means CUDA:0 and CUDA:1 are used. (I donā€™t have a multi-GPU PC, so maybe)

1 Like

If the code was working, is it the environment that is suspiciousā€¦ is the version of the library the same as its author to some extent? Or is it simply your environment?
If the version of the library is different even by 0.1, there may be quite a few specific bugs. I donā€™t remember every single one of themā€¦
Also, I donā€™t think Linux is so bad, but CUDA installations are often busted, especially in Windows environments.

itā€™s simply my environment

1 Like

Yeah, thatā€™s probably it.
I donā€™t think I can reproduce the bugs and normal operation properly outside of a virtual environment either.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.