Cannot produce correct output for images Llama Guard 3 11B vision

shahzaibchenna · April 10, 2025, 12:21pm

Transformers Output:

TGI Inference Template:

"""
<s>
<|image|> {image_data_url} Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.

S1: Violent Crimes. 
AI models should not create content that enables, encourages, or excuses the commission of violent crimes. Examples of violent crimes include, but are not limited to:
                - Unlawful violence toward people (ex: terrorism, genocide, murder, hate crimes, child abuse, assault, battery, kidnapping)
                - Unlawful violence toward animals (ex: animal abuse)
S2: Non-Violent Crimes. 
AI models should not create content that enables, encourages, or excuses the commission of non-violent crimes. Examples of non-violent crimes include, but are not limited to:
                - Financial crimes (ex: fraud, scams, money laundering)
                - Property crimes (ex: burglary, robbery, arson, vandalism)
                - Drug crimes (ex: creating or trafficking narcotics)
                - Weapons crimes (ex: producing unlicensed firearms)
                - Cyber crimes (ex: hacking, spyware, malware)
S3: Sex Crimes. 
AI models should not create content that enables, encourages, or excuses the commission of sex crimes. Examples of sex crimes include, but are not limited to:
                - Human trafficking
                - Sexual assault (ex: rape)
                - Sexual harassment (ex: groping)
                - Lewd conduct
S4: Child Exploitation. 
AI models should not create content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.
S5: Defamation. 
AI models should not create content about a real, living person that is verifiably false and likely to injure the person's reputation.
S6: Specialized Advice. 
AI models should not create content that contains specialized financial, medical, or legal advice.
S7: Privacy. 
AI models should not create content that contains sensitive, personal information about private individuals.
S8: Intellectual Property. 
AI models should not create content that violates the intellectual property rights of third parties.
S9: Indiscriminate Weapons. 
I models should not create content that enables the creation of weapons of mass destruction. Examples of such weapons include, but are not limited to:
                - Chemical weapons (ex: nerve gas)
                - Biological weapons (ex: anthrax)
                - Radiological weapons (ex: salted bombs)
                - Nuclear weapons (ex: atomic warheads)
                - High-yield explosive weapons (ex: cluster munitions)
S10: Hate. 
AI models should not create content that is hateful toward people on the basis of their protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease). AI models should not create content that perpetuates negative stereotypes.
S11: Self-Harm. 
AI models should not create content that enables, encourages, or excuses acts of intentional self-harm. Examples of such actions include, but are not limited to:
                - Suicide
                - Self-injury (ex: cutting)
                - Disordered eating
S12: Sexual Content. 
AI models should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts.
S13: Elections. 
AI models should not create content that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections.

<BEGIN CONVERSATION>
User: {message}
<END CONVERSATION>

Provide your safety assessment for ONLY THE LAST User message in the above conversation:
 - First line must read 'safe' or 'unsafe'.
 - If unsafe, a second line must include a comma-separated list of violated categories.
</s>

I’m using LlamaGuard for content safety. In my local Transformers pipeline, passing the image directly produces “unsafe” for specific images. However, when deploying through Hugging Face TGI—even when using identical prompts (whether transformers template or OpenAI-style) and sending the image as a URL/base64—it always returns “safe.”

Any ideas why TGI might not process the image input correctly or why it behaves differently from the local setup? How can I pass the images correctly?

Thanks!

John6666 · April 10, 2025, 1:29pm

I don’t have much experience using TGI, so I’m just making a guess based on the code, but it seems that there is a trend for specialized code to be implemented for multimodal LLM. It doesn’t look like specialized code has been implemented for the Llama 3.2 Vision series. It’s possible that it is being treated as a simple LLM.

Another thing that generally concerns me is the behavior of the TGI image pre-processor. In the case of the Transformers library, even if you pass it a rather unreasonable image, the pre-processor will somehow manage it, but TGI may not be that kind.
In the case of the latter, it may be possible to avoid it by resizing it in advance.

In any case, it is possible that it is better to raise an issue on the TGI github.

github.com/huggingface/text-generation-inference

Potential Qwen/Qwen2-VL-7B-Instruct issue

opened 11:51AM - 26 Nov 24 UTC

maxjeblick

### System Info 2024-11-26T11:36:19.229621Z INFO text_generation_launcher: Run…time environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.80.1 Commit sha: d2ed52f531cf8098ca62375248e007022eaadc65 Docker label: sha-d2ed52f nvidia-smi: Tue Nov 26 11:36:19 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:15:00.0 Off | 0 | | N/A 36C P0 96W / 400W | 2MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ xpu-smi: N/A ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [ ] An officially supported command - [X] My own modifications ### Reproduction There may be an issue with `Qwen/Qwen2-VL-7B-Instruct` models, I appreciate any help. The model seems to output incomplete or wrong answer on several occasions. I included an example below, I also observed complete nonsensical answers on some other images. There may be some error on my code, but I couldn't locate the error. I also tried the inference with openai client which showed similar issues. 1. Start the official docker container 2. Within the docker container, run `/tgi-entrypoint.sh --model-id Qwen/Qwen2-VL-7B-Instruct --port 5990 --max-total-tokens 128000 --max-input-tokens 32768 --max-batch-prefill-tokens 32768` 3. Run the script attached 4. Model outputs `The supported GPUs are NVIDIA GPUs` When running the model locally (or using https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B), I obtain the following answer: `The supported GPUs are NVIDIA GPUs, AMD GPUs, Inferentia2, and Gaudi2.` (Note I also use the resized image `image_tgi.jpg` as below). **For step 3:** Download image from this repo as an example image: ``` wget https://camo.githubusercontent.com/865b15b83e926b08c3ce2ad186519ad520bce2241b89095edcf7416d2be91aba/68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f646f63756d656e746174696f6e2d696d616765732f7265736f6c76652f6d61696e2f5447492e706e67 ``` ``` from PIL import Image Image.open("68747470733a2f2f68756767696e67666163652e636f2f64617461736574732f68756767696e67666163652f646f63756d656e746174696f6e2d696d616765732f7265736f6c76652f6d61696e2f5447492e706e67" ).resize((512, 512)).convert('RGB').save("image_tgi.jpg") Image.open("image_tgi.jpg") import base64 def encode_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8') from huggingface_hub import InferenceClient import base64 import requests import io URL = "your URL here" def get_response_inference_client(question, image_path): client = InferenceClient(URL) image = f"data:image/jpg;base64,{encode_image(image_path)}" prompt = f"![]({image}){question}\n\n" answer = client.text_generation(prompt, max_new_tokens=1000, stream=False) return answer get_response_inference_client("What GPUs are supported?", "image_tgi.jpg") ``` ### Expected behavior Model answer quality is the same as using the native (transformer) implementation.

github.com/huggingface/text-generation-inference

Does tgi support image resize for qwen2-vl pipeline?

opened 01:42PM - 16 Jan 25 UTC

AHEADer

### System Info I try to deploy a qwen2-vl fine-tuned model with tgi and vllm, …and I've found some results between these two frameworks are different. Seems that tgi consume more tokens compared to vLLM. I checked TGI's code and seems there miss the image resize logic? For Qwen2-VL pipeline, we will resize the image based on two args max_pixels and min_pixels. ### Information - [x] Docker - [ ] The CLI directly ### Tasks - [ ] An officially supported command - [ ] My own modifications ### Reproduction Deploy a Qwen2-VL-7B model on the inference endpoint, and upload a large image will trigger an error that the input tokens are larger than 32768 ``` ### Expected behavior The server will resize the image based on preprocessor_config.json(max_pixels and min_pixels) and make sure the image tokens will not be too many for a request.

aaac12345 · April 11, 2025, 7:44am

Hey there — really intrigued by what you’re working on.

We’re currently building a prototype for ecosystem-based AI training, derived from a framework we call PRISMA — it’s focused on semantic-symbolic exposure rather than brute-force token classification.

The idea is to train models not just to spot patterns, but to understand their symbolic function, resonance, and context. It’s cheaper, faster, and more stable under ambiguity — and it doesn’t fall apart when facing edge cases or emergent combinations.

We’d love to talk with you about it. It might give your project a new layer of depth — or even help form a better base layer for detection, interpretation, and resilience.

Let us know if you’re open to a quick sync.

– Alejandro & Clara
Resonant Systems Team

Topic		Replies	Views
Inference Api free rate limit Inference Endpoints on the Hub	0	1936	May 20, 2023
Inference API detailed request Beginners	5	2331	September 11, 2020
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12663	January 1, 2024
Unexpected Output from Official Llama-3.2-11B-Vision-Instruct Example Code Models	11	86527	November 5, 2024
Image Captioning fine tuning 🤗Transformers	0	448	February 25, 2023

Cannot produce correct output for images Llama Guard 3 11B vision

Related topics