Bitsandbytes and CUBLAS_STATUS_NOT_INITIALIZED error

I am trying to run microsoft/deberta-v3-xsmall model with bitsandbytes and ending up with CUDA error: CUBLAS_STATUS_ALLOC_FAILED when callingcublasCreate(handle)``

Everything runs fine without bitsandbytes or with just LORA, but bitsandbytes are causing error. I searched the internet and one possible solution is to adjust dimensions of matrices (incorrect number of labels). Is possible to manually step through the model and see dimensions of output of each layer?

How else can I troubleshoot?

Here is link to the notebook
https://www.kaggle.com/code/bridgeport/bitsandbytes-and-cublas-status-not-initialized

Here is full error message:

/tmp/ipykernel_36/2123443610.py:2: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Trainer.__init__. Use processing_class instead.trainer = Trainer(No label_names provided for model class PeftModelForSequenceClassification. Since PeftModel hides base models input arguments, if label_names is not given, label_names can’t be set automatically within Trainer. Note that empty label_names list will be used instead./usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py:745: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.return fn(*args, **kwargs)
RuntimeError                              Traceback (most recent call last)/tmp/ipykernel_36/2123443610.py in <cell line: 0>()9 )10—> 11 trainer.train()
/usr/local/lib/python3.11/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)2238                 hf_hub_utils.enable_progress_bars()2239         else: → 2240             return inner_training_loop(2241                 args=args,2242                 resume_from_checkpoint=resume_from_checkpoint,
/usr/local/lib/python3.11/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)2553                     )2554                     with context(): → 2555                         tr_loss_step = self.training_step(model, inputs, num_items_in_batch)25562557                     if (
/usr/local/lib/python3.11/dist-packages/transformers/trainer.py in training_step(self, model, inputs, num_items_in_batch)37433744         with self.compute_loss_context_manager(): → 3745             loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)37463747         del inputs
/usr/local/lib/python3.11/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs, num_items_in_batch)3808                 loss_kwargs[“num_items_in_batch”] = num_items_in_batch3809             inputs = {**inputs, **loss_kwargs} → 3810         outputs = model(**inputs)3811         # Save past state if it exists3812         # TODO: this needs to be fixed and made cleaner later.
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]1738         else: → 1739             return self._call_impl(*args, **kwargs)17401741     # torchrec tests the code consistency with the following code
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)1748                 or _global_backward_pre_hooks or _global_backward_hooks1749                 or _global_forward_hooks or _global_forward_pre_hooks): → 1750             return forward_call(*args, **kwargs)17511752         result = None
/usr/local/lib/python3.11/dist-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)191                 return self.module(*inputs[0], **module_kwargs[0])192             replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) → 193             outputs = self.parallel_apply(replicas, inputs, module_kwargs)194             return self.gather(outputs, self.output_device)195
/usr/local/lib/python3.11/dist-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)210         self, replicas: Sequence[T], inputs: Sequence[Any], kwargs: Any211     ) → List[Any]: → 212         return parallel_apply(213             replicas, inputs, kwargs, self.device_ids[: len(replicas)]214         )
/usr/local/lib/python3.11/dist-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)124         output = results[i]125         if isinstance(output, ExceptionWrapper): → 126             output.reraise()127         outputs.append(output)128     return outputs
/usr/local/lib/python3.11/dist-packages/torch/_utils.py in reraise(self)731             # instantiate since we don’t know how to732             raise RuntimeError(msg) from None → 733         raise exception734735
RuntimeError: Caught RuntimeError in replica 0 on device 0.Original Traceback (most recent call last):File “/usr/local/lib/python3.11/dist-packages/torch/nn/parallel/parallel_apply.py”, line 96, in _workeroutput = module(*input, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/peft/peft_model.py”, line 1559, in forwardreturn self.base_model(^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/peft/tuners/tuners_utils.py”, line 193, in forwardreturn self.model.forward(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 1089, in forwardoutputs = self.deberta(^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 796, in forwardencoder_outputs = self.encoder(^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 659, in forwardoutput_states, attn_weights = self._gradient_checkpointing_func(^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/_compile.py”, line 32, in innerreturn disable_fn(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py”, line 745, in _fnreturn fn(*args, **kwargs)^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/utils/checkpoint.py”, line 489, in checkpointreturn CheckpointFunction.apply(function, preserve, *args)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/autograd/function.py”, line 575, in applyreturn super().apply(*args, **kwargs)  # type: ignore[misc]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/utils/checkpoint.py”, line 264, in forwardoutputs = run_function(*args)^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 437, in forwardattention_output, att_matrix = self.attention(^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 370, in forwardself_output, att_matrix = self.self(^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 235, in forwardquery_layer = self.transpose_for_scores(self.query_proj(query_states), self.num_attention_heads)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1739, in _wrapped_call_implreturn self._call_impl(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py”, line 1750, in _call_implreturn forward_call(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/bitsandbytes/nn/modules.py”, line 565, in forwardreturn bnb.matmul_4bit(x, weight, bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/bitsandbytes/autograd/_functions.py”, line 466, in matmul_4bitreturn MatMul4Bit.apply(A, B, out, bias, quant_state)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/torch/autograd/function.py”, line 575, in applyreturn super().apply(*args, **kwargs)  # type: ignore[misc]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.11/dist-packages/bitsandbytes/autograd/_functions.py”, line 380, in forwardoutput = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

1 Like

If possible, it is best to update bitsandbytes appropriately. Please note that the supported CUDA version is limited.

pip uninstall -y torch torchvision torchaudio bitsandbytes
pip install -U bitsandbytes
# pick one CUDA line that Kaggle supports; 12.1 and 12.6 both work if your driver allows
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python - <<'PY'
import torch, bitsandbytes as bnb
print("torch:", torch.__version__, "cuda:", torch.version.cuda, "available:", torch.cuda.is_available())
import subprocess, sys; subprocess.run([sys.executable, "-m", "bitsandbytes"])
PY
1 Like

Thank you for your answer!!!

I first ran the following lines (on Kaggle):

!nvcc --version

Output:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Thu_Jun__6_02:18:23_PDT_2024 Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0
import platform
import sys
import torch
print(f"platform: {platform.system()}“)
print(f"release: {platform.release()}”)
print(f"version: {platform.version()}“)
print(f"machine: {platform.machine()}”)
print(f"compiler: {sys.version}“)
print(f"GPU/TPU: {torch.cuda.get_device_name()}”)

Output:

platform: Linux
release: 6.6.56+
version: #1 SMP PREEMPT_DYNAMIC Sun Nov 10 10:07:59 UTC 2024
machine: x86_64
compiler: 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
GPU/TPU: Tesla T4

Everything matches installation guide requirements.

Then I ran your code:

!pip uninstall -y torch torchvision torchaudio bitsandbytes!pip install -U bitsandbytes
#pick one CUDA line that Kaggle supports; 12.1 and 12.6 both work if your driver allows
!pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
!python - <<‘PY’
and finally I tried:
import torch, bitsandbytes as bnb

It generated the following error (short version):

AssertionError: DeviceInterface member Event should be inherit from _EventBase

I also tried different version of

!pip install -U torch torchvision torchaudio --index-url ``https://download.pytorch.org/whl/cu125

It generated slightly different error.

I did it all on Kaggle GPU: T4x2

Is there anything else I can try to tweak?

P.S. I could not get formatting of the post to work.

1 Like

It seems that conflicts occur when new and old versions of PyTorch collide within Kaggle…

Option A: current stable CUDA 12.5 wheels + latest bnb

pip uninstall -y torch torchvision torchaudio bitsandbytes triton
pip cache purge
pip install --no-cache-dir --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu125
pip install --no-cache-dir -U bitsandbytes

Option B: known-good pair (often simplest on Kaggle)

pip uninstall -y torch torchvision torchaudio bitsandbytes triton
pip cache purge
pip install --no-cache-dir --force-reinstall torch==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U bitsandbytes==0.43.3

Then

import os; os.kill(os.getpid(), 9)  # forces Kaggle runtime restart

# verify
import torch, bitsandbytes as bnb
print("torch", torch.__version__, "CUDA", torch.version.cuda, "avail", torch.cuda.is_available())
import subprocess, sys; subprocess.run([sys.executable, "-m", "bitsandbytes"])  # prints detected CUDA backend

Thank you so much!

I tried option A and still got the same error CUBLAS_STATUS_NOT_INITIALIZED

Here are some details of #verify:

`torch 2.8.0+cu128 CUDA 12.8 avail True
=================== bitsandbytes v0.47.0 ===================
Platform: Linux-6.6.56±x86_64-with-glibc2.35
libc: glibc-2.35
Python: 3.11.13
PyTorch: 2.8.0+cu128
CUDA: 12.8
HIP: N/A
XPU: N/A
Related packages:
accelerate: 1.8.1
diffusers: 0.34.0
numpy: 1.26.4
pip: 24.1.2
peft: 0.15.2
safetensors: 0.5.3
transformers: 4.52.4
triton: 3.4.0
trl: not found

PyTorch settings found: CUDA_VERSION=128, Highest Compute Capability: (7, 5).
Checking that the library is importable and CUDA is callable…
SUCCESS!
CompletedProcess(args=[‘/usr/bin/python3’, ‘-m’, ‘bitsandbytes’], returncode=0)`

I also tried option B. It failed much sooner after this lines:

`from transformers import DebertaV2ForSequenceClassification
n_classes = 65
model = DebertaV2ForSequenceClassification.from_pretrained(
model_name,
num_labels=n_classes, quantization_config=q_config,
ignore_mismatched_sizes=True
)`

with error:

ValueError: Due to a serious vulnerability issue in torch.load, even with weights_only=True, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here ``https://nvd.nist.gov/vuln/detail/CVE-2025-32434

Here are details of #verify:

`torch 2.4.1+cu121 CUDA 12.1 avail True
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
CUDA specs: CUDASpecs(highest_compute_capability=(7, 5), cuda_version_string=‘121’, cuda_version_tuple=(12, 1))
PyTorch settings found: CUDA_VERSION=121, Highest Compute Capability: (7, 5).
To manually override the PyTorch CUDA version please see: https://github.com/TimDettmers/bitsandbytes/blob/main/docs/source/nonpytorchcuda.mdx
CUDA SETUP: WARNING! CUDA runtime files not found in any environmental path.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Checking that the library is importable and CUDA is callable…
SUCCESS!
Installation was successful!
CompletedProcess(args=[‘/usr/bin/python3’, ‘-m’, ‘bitsandbytes’], returncode=0)`

I also tried to update option B to `torch==2.4.1`, but it also produced error at #verify stage:

ModuleNotFoundError: No module named 'triton.ops'

1 Like

The latest bitsandbytes seems to support a fairly wide range of CUDA versions now.

CUBLAS_STATUS_NOT_INITIALIZED

Hmm… It might be caused by T4x2. Also, T4 does not support bfloat16, so if you are using it, avoid using bfloat16.

CVE-2025-32434

If you want to use something other than safetensors in older versions of PyTorch, you will need to downgrade Transformers. Also if torch is old, you also need to downgrade triton to an older version.

Option B (Fixed)

pip uninstall -y torch torchvision torchaudio bitsandbytes triton
pip cache purge
pip install --no-cache-dir --force-reinstall torch==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U bitsandbytes==0.43.3 triton==3.1.0 transformers==4.48.3

Thank you so much for not giving up on this issue :slight_smile:

I updated bnb setup to use torch.float16 instead of torch.bfloat16

I tried new Plan B (fixed).

Here is link to installation details (they are very long):

Installation details

Here is output of verify stage:

torch 2.5.1+cu124 CUDA 12.4 avail True++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++CUDA specs: CUDASpecs(highest_compute_capability=(7, 5), cuda_version_string=‘124’, cuda_version_tuple=(12, 4))PyTorch settings found: CUDA_VERSION=124, Highest Compute Capability: (7, 5).To manually override the PyTorch CUDA version please see: https://github.com/TimDettmers/bitsandbytes/blob/main/docs/source/nonpytorchcuda.mdxCUDA SETUP: WARNING! CUDA runtime files not found in any environmental path.++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Checking that the library is importable and CUDA is callable…SUCCESS!Installation was successful!CompletedProcess(args=[‘/usr/bin/python3’, ‘-m’, ‘bitsandbytes’], returncode=0)

Here is new error: AttributeError: ‘Tensor’ object has no attribute ‘quant_state’

Here is full trace: trace

1 Like

I think I found the same phenomenon.

I updated my code:

from transformers import DebertaV2ForSequenceClassification
n_classes = 65
model =DebertaV2ForSequenceClassification.from_pretrained(model_name,
num_labels=n_classes,
 quantization_config=q_config,
device_map=“auto”)

But got the following error:

ValueError: DebertaV2ForSequenceClassification does not support 
`device_map='auto'`. To implement support, the model class needs to
 implement the `_no_split_modules` attribute.

This is very puzzling. I tried to do the same in Colab and it worked. In Colab I used only single line:

!pip install -U bitsandbytes  

but Colab generates AssertionError: after trainer.train() line.

1 Like

Accelerate is essential.

pip uninstall -y torch torchvision torchaudio bitsandbytes triton
pip cache purge
pip install --no-cache-dir --force-reinstall torch==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -U bitsandbytes==0.43.3 triton==3.1.0 transformers==4.48.3 accelerate

Thank you.

I ran newly suggested code.

With device_map=‘auto’ I am still getting ValueError as above.

Without device_map='auto, I am getting:

AttributeError: 'Tensor' object has no attribute 'quant_state'
1 Like

AttributeError: 'Tensor' object has no attribute 'quant_state'

This error tends to occur when using .to("cuda") .cuda() .to("cpu") on models loaded via bitsandbytes.

Thank you, I will need to read and understand the link that you shared.

My model is fairly small microsoft/deberta-v3-xsmall and I am not moving it between CPU and GPU. I am simply trying to get bnb to work, nothing fancy.

1 Like

Hmm… Try device_map="cuda". For now, this works on Colab Free (GPU).

!pip install -U torch==2.4.1 torchvision --index-url https://download.pytorch.org/whl/cu121
!pip install -U bitsandbytes==0.43.3 transformers==4.48.3 accelerate triton==3.1.0
import os, torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig

assert torch.cuda.is_available(), "CUDA required."

model_id = "microsoft/deberta-v3-xsmall"

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, # T4 lacks BF16.
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    revision="refs/pr/4",
    num_labels=2,
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg,
    device_map="cuda",
).eval()

tok = AutoTokenizer.from_pretrained(model_id)
x = tok(["this works"], padding=True, truncation=True, return_tensors="pt").to("cuda")

with torch.inference_mode():
    out = model(**x)
print(out) # SequenceClassifierOutput(loss=None, logits=tensor([[-0.0417, -0.0558]], device='cuda:0'), hidden_states=None, attentions=None)
1 Like

Thank you so much! I tried to adapt your example to training case:

!pip install -U torch==2.4.1 torchvision --index-url https://download.pytorch.org/whl/cu121
!pip install -U bitsandbytes==0.43.3 transformers==4.48.3 accelerate triton==3.1.0
import os; os.kill(os.getpid(), 9)

import os
import pandas as pd
from datasets import Dataset
import torch
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, BitsAndBytesConfig
from transformers import AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

my_df = pd.DataFrame({'text': ['This is our text to train', 'Even more training is here'], 'label': [30, 21] })
my_df_ds =  Dataset.from_pandas(my_df)
model_id = "microsoft/deberta-v3-xsmall"

tokenizer = AutoTokenizer.from_pretrained(model_id)
MAX_LEN = 256

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256)
my_df_ds = my_df_ds.map(tokenize, batched=True)
assert torch.cuda.is_available(), "CUDA required."

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, # T4 lacks BF16.
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    revision="refs/pr/4",
    num_labels=2,
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg,
    device_map="cuda",
)

from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

from peft import LoraConfig

config = LoraConfig(
    r=16,
    lora_alpha=8,
    target_modules=["value_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)

from peft import get_peft_model

model = get_peft_model(model, config)

training_args = TrainingArguments(
    output_dir = "Temp",
    do_train=True,
    save_strategy="steps", #no for no saving 
    num_train_epochs=1,
    per_device_train_batch_size=16*2,
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_steps=50,
    save_steps=200,
    save_total_limit=1,
    greater_is_better=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_df_ds,
    tokenizer=tokenizer
)

trainer.train()

It generates the following error (on Kaggle):

TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
 * (torch.device device)
      didn't match because some of the arguments have invalid types: (!NoneType!)
 * (str type, int index = -1)
1 Like

Are you using DataParallel…?
How about this? device_map="cuda:0" instead of device_map="cuda" or os.environ["CUDA_VISIBLE_DEVICES"] = "0" in advance.

I can try it. I am running my code on Kaggle T4x2, so I put this line os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

Was I supposed to used something else with this line?

1 Like

on Kaggle T4x2

To avoid errors by having PyTorch recognize only a single GPU in that environment:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Instead, when aiming to fully utilize all GPUs, it’s preferable to explicitly rerun the script using a multi-GPU-compatible backend from Accelerate or similar tools. This approach handles advanced operations internally, making it more complex. I believe Kaggle has documentation on this…

Ok, I changed to

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

but still got the same error.

1 Like

Hmm… Try device_map="cuda:0", too.

Edit:
In a multi-GPU environment, you have to run the script like this to get the multi-GPU setup working properly, but I don’t know how that’s handled on Kaggle…