ValueError: Image features and image tokens do not match

I am trying to use assistant_model for Llava 7B, but it seems like nothing is working. Transformers version = 4.51.2
Reproducible code example:

from transformers import LlavaOnevisionForConditionalGeneration, LlavaOnevisionProcessor
from PIL import Image

import torch
import requests

img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
           "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"]
images = [Image.open(requests.get(img_urls[0], stream=True).raw),
          Image.open(requests.get(img_urls[1], stream=True).raw)]

target_processor = LlavaOnevisionProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
target_processor.tokenizer.padding_side = "left"
draft_processor = LlavaOnevisionProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
draft_processor.tokenizer.padding_side = "left"
target = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf").to('cuda')
draft = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf").to('cuda')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in 500 words."}
        ]
    }
]

prompt = target_processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = target_processor(text=prompt, images=[images[0]], return_tensors="pt").to("cuda")

with torch.no_grad():
    generated_ids = target.generate(**inputs, max_new_tokens=1000, assistant_model=draft, tokenizer=target_processor.tokenizer, assistant_tokenizer=draft_processor.tokenizer)
generated_texts = target_processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

I kept getting this error: raise ValueError( ValueError: Image features and image tokens do not match: tokens: 0, features 2709

1 Like

Similar issue?

It seems to be an error that occurs easily, and it is difficult to find the cause…