RuntimeError: CUDA error: named symbol not found when using TorchAoConfig with Qwen2.5-VL-7B-Instruct model

I’m trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation Here), but I’m getting a runtime error related to CUDA.

Code:

from transformers import Qwen2_5_VLForConditionalGeneration, TorchAoConfig, AutoProcessor
import torch

torch.cuda.empty_cache()

quantization_config = TorchAoConfig("int4_weight_only", group_size=128)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

I got the following error:

Error:

RuntimeError                              Traceback (most recent call last)
/tmp/ipython-input-9-2218636408.py in <cell line: 0>()
     13 
     14 quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
---> 15 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
     16     "Qwen/Qwen2.5-VL-7B-Instruct",
     17     torch_dtype=torch.bfloat16,

12 frames
/usr/local/lib/python3.11/dist-packages/torchao/quantization/utils.py in pack_tinygemm_scales_and_zeros(scales, zeros, dtype)
    356     guard_dtype_size(zeros, "zeros", dtype=dtype)
    357     return (
--> 358         torch.cat(
    359             [
    360                 scales.reshape(scales.size(0), scales.size(1), 1),

RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am new to this, probably missing something simple , Any help or insights would be appreciated!.

1 Like

I’m not sure if this is the cause, but it seems to happen for sure when the GPU does not natively support bfloat16. In terms of GeForce, it is supported on RTX 30x0 series and later.

1 Like

I’ve seen this before. This worked for me. I noticed you’re not using venv too. Make sure you use it every time.

python3 -m venv hfenv
source hfenv/bin/activate

pip install --upgrade pip
pip install --upgrade torch torchao --extra-index-url https://download.pytorch.org/whl/cu121

rm -rf ~/.cache/torch_extensions/

You can run a debug too if that doesn’t work.
CUDA_LAUNCH_BLOCKING=1 python3 your_script.py

2 Likes

Umm i think i missed to mention, my bad, but i am using google colab.

1 Like

If on Colab:

!pip install --upgrade torch torchao --extra-index-url https://download.pytorch.org/whl/cu121
1 Like

As a “workaround” simply using Bitsandbytesconfig worked for me. But let me also give a try what you suggested, because its simply kinda confusing as to why the approach mentioned in the documentation wont work. Thanks !!!

1 Like