ValueError: Image features and image tokens do not match

Brianzhengca · April 14, 2025, 7:47am

I am trying to use assistant_model for Llava 7B, but it seems like nothing is working. Transformers version = 4.51.2
Reproducible code example:

from transformers import LlavaOnevisionForConditionalGeneration, LlavaOnevisionProcessor
from PIL import Image

import torch
import requests

img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
           "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"]
images = [Image.open(requests.get(img_urls[0], stream=True).raw),
          Image.open(requests.get(img_urls[1], stream=True).raw)]

target_processor = LlavaOnevisionProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
target_processor.tokenizer.padding_side = "left"
draft_processor = LlavaOnevisionProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
draft_processor.tokenizer.padding_side = "left"
target = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf").to('cuda')
draft = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf").to('cuda')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in 500 words."}
        ]
    }
]

prompt = target_processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = target_processor(text=prompt, images=[images[0]], return_tensors="pt").to("cuda")

with torch.no_grad():
    generated_ids = target.generate(**inputs, max_new_tokens=1000, assistant_model=draft, tokenizer=target_processor.tokenizer, assistant_tokenizer=draft_processor.tokenizer)
generated_texts = target_processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

I kept getting this error: raise ValueError( ValueError: Image features and image tokens do not match: tokens: 0, features 2709

John6666 · April 14, 2025, 8:05am

Similar issue?

github.com/huggingface/transformers

Mismatch Between Image Tokens and Features in LLaVA Model Fine-Tuning

opened 12:15PM - 01 Feb 25 UTC

closed 08:36AM - 02 Feb 25 UTC

Md-Nasif03

**Model: llava-hf/llava-1.5-7b-hf** **Issue Description** When I try to generate… a response using the fine-tuned model, I encounter the following error: ValueError: Image features and image tokens do not match: tokens: 575, features: 576 This error occurs during the generate() call, indicating a mismatch between the number of image tokens and image features. **Steps I’ve Taken:** - Image Preprocessing: - I resized the input image to dimensions that are multiples of the patch_size (14 for LLaVA models). - The image is resized to 518x336, which is a multiple of 14. **Processor Configuration:** - I manually set the patch_size and vision_feature_select_strategy in the processor to match the model's configuration. - I verified that the processor's configuration is correct. **Debugging Inputs:** - I printed the inputs (input_ids, pixel_values, etc.) to ensure they are correctly formatted. - The inputs are moved to the GPU for processing. **Model Loading:** The model and processor are loaded from the fine-tuned directory, and the model is moved to the GPU. Code Snippet Here’s the relevant part of my code: ` ``` #Load the processor and model processor1 = LlavaProcessor.from_pretrained(model_path) new_model_v1 = LlavaForConditionalGeneration.from_pretrained(model_path).to("cuda:0") #Resize the image patch_size = new_model_v1.config.vision_config.patch_size shortest_edge = processor1.image_processor.size.get("shortest_edge", 336) original_width, original_height = raw_image.size scale_factor = shortest_edge / min(original_width, original_height) new_width = int(original_width * scale_factor) new_height = int(original_height * scale_factor) new_width = (new_width // patch_size) * patch_size new_height = (new_height // patch_size) * patch_size raw_image = raw_image.resize((new_width, new_height)) # Resized to multiples of patch_size #Process inputs inputs = processor1(images=raw_image, text=prompt, return_tensors='pt') inputs = {k: v.to("cuda:0") for k, v in inputs.items()} #Generate response output = new_model_v1.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=200, do_sample=False ) ``` **Questions:** - Why is there a mismatch between the number of image tokens (575) and image features (576)? - Is there a mistake in my image preprocessing or model configuration that could cause this issue? - How can I ensure that the number of image tokens matches the number of image features? - Are there any additional steps I need to take to align the image tokens and features correctly? **Additional Information** - I have tried the Hugging Face transformers library both version 4.48.2 and 4.48.2. - The fine-tuned model is saved in Google Drive, and I’m loading it in a new Colab session. - The base model is llava-hf/llava-1.5-7b-hf. <img width="885" alt="Image" src="https://github.com/user-attachments/assets/444c2286-629d-4385-85df-6c485a27f076" /> <img width="418" alt="Image" src="https://github.com/user-attachments/assets/3a24ce8b-09ae-4a36-b54f-3de0946619b8" />

John6666 · April 14, 2025, 8:18am

It seems to be an error that occurs easily, and it is difficult to find the cause…

github.com/huggingface/transformers

LLaVA-OneVision image features and image tokens mismatch

opened 05:13PM - 19 Jan 25 UTC

closed 08:10AM - 24 Jan 25 UTC

sheryc

bug

### System Info - `transformers` version: 4.48.0 - Platform: Linux-5.15.0-1067-…nvidia-x86_64-with-glibc2.35 - Python version: 3.11.11 - Huggingface_hub version: 0.27.1 - Safetensors version: 0.5.2 - Accelerate version: 1.2.1 - Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 1 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: False - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': '', 'fsdp_use_orig_params': True} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] - PyTorch version (GPU?): 2.5.1 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using distributed or parallel set-up in script?: False - Using GPU in script?: True - GPU type: NVIDIA H100 80GB HBM3 ### Who can help? @amyeroberts @qubvel @zucchini-nlp ### Information - [ ] The official example scripts - [x] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below) ### Reproduction ```python from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration from datasets import load_dataset import torch processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf") model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", torch_dtype=torch.float16, device_map="auto") dataset = load_dataset("lmms-lab/docvqa", 'DocVQA') d = dataset['test'][2482] question = d['question'] image = d['image'] conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": question}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model(**inputs) ``` Traceback as follows: ``` Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/home/miniforge3/envs/vek/lib/python3.11/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 688, in forward raise ValueError( ValueError: Image features and image tokens do not match: tokens: 7332, features 7261 ``` ### Expected behavior Expected: Output correctly without errors. This is a follow-up issue of https://github.com/huggingface/transformers/issues/34625, where the behavior is the same but for different reasons. The reproduction example is a slight modification of the one provided by @chchch0109.

github.com/huggingface/transformers

LLaVa with multiple image input throws error: Image features and image tokens do not match

opened 02:19PM - 21 Oct 24 UTC

closed 09:21AM - 30 Oct 24 UTC

Shruthi42

bug Vision Multimodal

### System Info transformers version: 4.46.0.dev0 ### Who can help? @amyerobe…rts, @qubvel ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction Before #33608 (testing on commit d00f1ca860f19f4c0962882e56044bb6ef7b5626) the code below would run without error: ``` python import torch from transformers import LlavaForConditionalGeneration, LlavaProcessor model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf") processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf") processor.patch_size = 14 processor.vision_feature_select_strategy = "default" device = torch.device("cuda") model = model.eval() model = model.to(device) inputs = processor( text=["Sentence with two images 1. <image> 2. <image>", "Sentence with one image <image>"], images= torch.rand((3, 3, 336, 336), dtype=torch.float), return_tensors="pt", truncation=True, padding=True, ) inputs = inputs.to(device) with torch.no_grad(): model(**inputs) ``` However, after #33608 (testing on commit 0f49deacbff3e57cde45222842c0db6375e4fa43), it fails with the error ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 21 19 inputs = inputs.to(device) 20 with torch.no_grad(): ---> 21 model(**inputs) File ~/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs) 1734 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1735 else: -> 1736 return self._call_impl(*args, **kwargs) File ~/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs) 1742 # If we don't have any hooks, we want to skip the rest of the logic in 1743 # this function, and just call forward. 1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1745 or _global_backward_pre_hooks or _global_backward_hooks 1746 or _global_forward_hooks or _global_forward_pre_hooks): -> 1747 return forward_call(*args, **kwargs) 1749 result = None 1750 called_always_called_hooks = set() File ~/miniconda3/envs/hf/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py:524, in LlavaForConditionalGeneration.forward(self, input_ids, pixel_values, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep) 522 n_image_features = image_features.shape[1] 523 if n_image_tokens != n_image_features: --> 524 raise ValueError( 525 f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}" 526 ) 527 special_image_mask = ( 528 (input_ids == self.config.image_token_index) 529 .unsqueeze(-1) 530 .expand_as(inputs_embeds) 531 .to(inputs_embeds.device) 532 ) 533 image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype) ValueError: Image features and image tokens do not match: tokens: 1152, features 576 ``` ### Expected behavior Before #33608 multi-image input and variable-image input to LLaVa worked as expected. The added check on image features and image tokens in #33608 doesn't seem to take into account 1. input sequences with multiple images 2. batches with variable number of images in each input sequence.

Topic		Replies	Views
Transformers CausalLM loss is always nan 🤗Transformers	0	180	April 18, 2024
How to resolve the hugging face error ImportError: cannot import name 'is_tokenizers_available' from 'transformers.utils'? Beginners	6	83373	October 21, 2024
Simple use of Transformers breaks Beginners	1	1384	June 2, 2023
Finetuning Vision Encoder Decoder Models with huggingface causes ValueError: expected sequence of length 11 at dim 2 (got 12) Beginners	0	493	March 12, 2023
How to get normal LLava-1.6 attention maps? 🤗Transformers	1	153	April 6, 2025

ValueError: Image features and image tokens do not match

Related topics