Qwen-VL Parallel GPU run not able to solve

Velan · September 16, 2023, 7:43am

Im not able to pass url after setting device map to Auto.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("bibimbap/Qwen-VL-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
#model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("bibimbap/Qwen-VL-Chat", trust_remote_code=True)

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
    {'text': 'What is this?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。

# 2nd dialogue turn
response, history = model.chat(tokenizer, 'Hello!', history=history)
print(response)
# <ref>击掌</ref><box>(536,509),(588,602)</box>
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('1.jpg')
else:
  print("no box")

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!

i keep getting this error.
1.On running the model on fp32, everything works but it is very slow. So this is not an option i want to use.
2.On loading all tensors to one gpu i am getting cuda out of memory error.
3.I am also getting errors RuntimeError: “addmm_impl_cpu_” not implemented for ‘Half’ and slow_conv2d_cpu not implemented for ‘half’ on running parallelly.
GPU server used:
we have azure server Standard_NC64as_T4_v3, we have gpu with GPU memeory of 64 GIB ram and it has . It looks like it’s taking 16 gb ram.It has 64 vcpu, memroy of 440 GiB. It has 4 NVIDIA T4 GPUs with 16 GB of memory each, up to 64 non-multithreaded AMD EPYC 7V12 (Rome) processor cores and 448 GiB of system memory.

Please prove a solution for this problem

Nishgop · February 20, 2024, 4:50pm

Did you get this issue resolved?

Topic		Replies	Views
How to use Qwen2-VL on multiple gpus? 🤗Transformers	2	1272	September 28, 2024
ValueError: Please use the `disk_offload` function instead Beginners	1	995	August 21, 2024
Need help performance issues transformers.AutoModelForCausalLM.from_pretrained( 'mosaicml/mpt-7b-instruct' Beginners	0	930	June 12, 2023
How can I set `max_memory` parameter while loading Quantized model with Model Pipeline class? 🤗Transformers	2	49	March 18, 2025
Tokenizer setting for model = LlamaForCausalLM.from_pretrained(model_path, device_map='auto') Models	0	1124	August 25, 2023

Qwen-VL Parallel GPU run not able to solve

Related topics