Qwen-VL Parallel GPU run not able to solve

Im not able to pass url after setting device map to Auto.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("bibimbap/Qwen-VL-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
#model = AutoModelForCausalLM.from_pretrained("bibimbap/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("bibimbap/Qwen-VL-Chat", trust_remote_code=True)

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
    {'text': 'What is this?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名女子在沙滩上和狗玩耍,旁边是一只拉布拉多犬,它们处于沙滩上。

# 2nd dialogue turn
response, history = model.chat(tokenizer, 'Hello!', history=history)
print(response)
# <ref>击掌</ref><box>(536,509),(588,602)</box>
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('1.jpg')
else:
  print("no box")

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!

i keep getting this error.
1.On running the model on fp32, everything works but it is very slow. So this is not an option i want to use.
2.On loading all tensors to one gpu i am getting cuda out of memory error.
3.I am also getting errors RuntimeError: “addmm_impl_cpu_” not implemented for ‘Half’ and slow_conv2d_cpu not implemented for ‘half’ on running parallelly.
GPU server used:
we have azure server Standard_NC64as_T4_v3, we have gpu with GPU memeory of 64 GIB ram and it has . It looks like it’s taking 16 gb ram.It has 64 vcpu, memroy of 440 GiB. It has 4 NVIDIA T4 GPUs with 16 GB of memory each, up to 64 non-multithreaded AMD EPYC 7V12 (Rome) processor cores and 448 GiB of system memory.

Please prove a solution for this problem

1 Like

Did you get this issue resolved?