Can't perform image inference with Gemma 3 12b it qat4.0

oraekene · April 23, 2025, 2:06pm

hi, i am a noob here. please can you share a code snippet for how to use this gemma 3 version to perform inference on images? specific i want it to filter images from an input folder into different output folder based on a set of criteria i outline in the prompt. the prompt tells it to answer yes or no, if an image meets or doesn’t meet the criteria, then my code uses that to move the images to their appropriate folders.

here is my prompt:
image_soft_token
Analyze the image. Does it meet BOTH criteria: 1. At least 2 football players visible. 2. At least one player performing a clear football action (kick, tackle, dribble, save etc.)? Answer ONLY YES or NO.

here is the output:
ERROR:root:STEP 3 FAILED: Error during processor preparation for laliga_image_100.jpeg: Prompt contained 0 image tokens but received 1 images.
Traceback (most recent call last):
File “”, line 40, in analyze_image_gemma3_transformers
inputs = processor(text=PROMPT_TEXT_CLASSIFY, images=img, return_tensors=“pt”).to(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/transformers/models/gemma3/processing_gemma3.py”, line 122, in call
raise ValueError(
ValueError: Prompt contained 0 image tokens but received 1 images.
Found 457 image files in ‘Colab_Uploads/Football_Images_Input’.
Starting Gemma 3 Transformers processing loop for 457 images…
Processing time depends on hardware (cuda).
— DIAGNOSTIC MODE: Processing ONLY the first file: laliga_image_100.jpeg —
— Starting analysis for: laliga_image_100.jpeg —
STEP 2 SUCCESS: Loaded image laliga_image_100.jpeg
DEBUG: Prompt being passed to processor:

image_soft_token
Analyze the image. Does it meet BOTH criteria: 1. At least 2 football players visible. 2. At least one player performing a clear football action (kick, tackle, dribble, save etc.)? Answer ONLY YES or NO.

<<<
— Finished analysis attempt for: laliga_image_100.jpeg —
— DIAGNOSTIC MODE: Finished processing laliga_image_100.jpeg —
— Gemma 3 Transformers Processing Session Complete —
Images attempted in this session (Gemma3 TF): 1

Successfully classified (YES/NO): 0
Errors (moved to ‘Football_Images_Errors_Gemma3_TF’): 1
Images skipped (already processed): 0
Estimated image files remaining in ‘Colab_Uploads/Football_Images_Input’: 456
Check the ‘Football_Images_Errors_Gemma3_TF’ folder for Gemma 3 TF processing errors.
Results are in ‘Football_Images_Meets_Criteria_Gemma3_TF’ and ‘Football_Images_Does_Not_Meet_Gemma3_TF’.

here is gemini 2.5 pro’s suggestion:
Okay, the added print statement confirms it perfectly.

The Variable is Correct: The DEBUG: Prompt being passed to processor: output clearly shows the string does start with <image_soft_token>\n… So, the variable PROMPT_TEXT_CLASSIFY is correctly updated and passed to the function.
The Processor Fails: Despite receiving the correct prompt string containing the <image_soft_token>, the processor’s internal logic (/usr/local/lib/python3.11/dist-packages/transformers/models/gemma3/processing_gemma3.py, line 122) still fails to detect it and incorrectly reports finding 0 image tokens.

Conclusion:

This definitively looks like a bug within the Gemma3Processor implementation in the transformers library specifically for the model handle google/gemma-3/transformers/gemma-3-12b-it-qat-int4-unquantized (or perhaps for Gemma 3 processing in general in the current library version).

The processor is simply not correctly parsing the special token it claims to use (<image_soft_token>) from the text input when an image is also provided.

i am running this in google colab. i had to remove the <> tag symbols from image_soft_token to display here to indicate that i include the tag in the prompt

Gemini 2.5 pro says it is a bug with the transformer architecture for this model, that i should report it on their github, but i just want to be sure it isn’t actually due to my lack of knowledge. i would be very grateful for any help on this

John6666 · April 23, 2025, 2:26pm

It seems that this error occurs when you pass an image file but do not describe the image in the prompt.

It should be easier to understand if you look at the sample code for Gemma 3.

github.com/huggingface/transformers

Some questions of `Gemma3` processor

opened 03:43PM - 13 Mar 25 UTC

closed 12:07PM - 14 Mar 25 UTC

Kuangdd01

VLM

Thanks for bringing us a nice implementation of the `Gemma3` model! After readi…ng the code, I have a question about `gemma3.processing.py`. This segment of code is as follows: [code](https://github.com/huggingface/transformers/blob/d84569387fb1f88c86fb8d82a41f20c9e207d09e/src/transformers/models/gemma3/processing_gemma3.py#L126C16-L133C60) ```python for batch_idx, (prompt, images, num_crops) in enumerate(zip(text, batched_images, batch_num_crops)): image_indexes = [m.start() for m in re.finditer(self.boi_token, prompt)] if len(images) != len(image_indexes): raise ValueError( f"Prompt contained {len(image_indexes)} image tokens but received {len(images)} images." ) # Insert additional image tokens for Pan-and-Scan crops for num, idx in reversed(list(zip(num_crops, image_indexes))): if num: formatted_image_text = ( f"Here is the original image {self.boi_token} and here are some crops to help you see better " + " ".join([self.boi_token] * num) ) prompt = prompt[:idx] + formatted_image_text + prompt[idx + len(self.boi_token) :] text_with_crops[batch_idx] = prompt ``` I can see that this code is handling the placeholders for the image and when `Pan-and-Scan` is on, the crops of the image will also be added to the sentence before feeding into the tokenizer. But `text_with_crops` seems never to be used after that. The sub-fig in a sentence is a nice feature for me! Is there something I missed or is the code in `if num` incomplete? @RyanMullins @zucchini-nlp

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, # Maybe you don't have such line
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    }
]

Topic		Replies	Views
Dataset Tokenization to Fine-Tune Gemma 3 1B Beginners	2	368	April 5, 2025
ValueError: Image features and image tokens do not match 🤗Transformers	2	1477	April 14, 2025
Error while Trying to run inference using gaudi on a finetuned llama2 model using habana repo 🤗Optimum	9	654	August 21, 2023
Model inference using batch (Encoder-decoder) Models	0	641	September 13, 2023
Inference API returns 504 error for Llama-3.2-3B-Instruct & google/gemma-2-2b-it Inference Endpoints on the Hub	3	33	April 21, 2025

Can't perform image inference with Gemma 3 12b it qat4.0

Related topics