I’m trying to load the Qwen2.5-VL-7B-Instruct model from hugging face with 4-bit weight-only quantization using TorchAoConfig (similar to how its mentioned in the documentation Here), but I’m getting a runtime error related to CUDA.
Code:
from transformers import Qwen2_5_VLForConditionalGeneration, TorchAoConfig, AutoProcessor
import torch
torch.cuda.empty_cache()
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
I got the following error:
Error:
RuntimeError Traceback (most recent call last)
/tmp/ipython-input-9-2218636408.py in <cell line: 0>()
13
14 quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
---> 15 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
16 "Qwen/Qwen2.5-VL-7B-Instruct",
17 torch_dtype=torch.bfloat16,
12 frames
/usr/local/lib/python3.11/dist-packages/torchao/quantization/utils.py in pack_tinygemm_scales_and_zeros(scales, zeros, dtype)
356 guard_dtype_size(zeros, "zeros", dtype=dtype)
357 return (
--> 358 torch.cat(
359 [
360 scales.reshape(scales.size(0), scales.size(1), 1),
RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I am new to this, probably missing something simple , Any help or insights would be appreciated!.