Issue with Using LLaMA 3.2 11b vision instruct Model for Text-Only Input: Invalid Input Type Error

rahool777 · October 17, 2024, 9:18am

I’m currently working on a project where I am trying to generate text using the meta-llama/Llama-3.2-11B-Vision-Instruct model. However, I’m only interested in processing text, not images. Despite this, when I pass in a text input, I receive the error as “Invalid input type. Must be a single image, a list of images, or a list of batches of images.”

I’m using AutoProcessor to handle inputs, but since I only want to process text, I suspect the processor is expecting an image input due to the model’s multi-modal nature.

How can I configure the meta-llama/Llama-3.2-11B-Vision-Instruct model to process only text without throwing the image-related error?

Is there a different recommended way to handle text-only inputs with this model?

Thanks in advance for your help.

John6666 · October 17, 2024, 11:52am

The reason is probably because of this pipeline designation in the repo setup. There are several repos in the copy, and some of them do not have that designation, so let’s give it a try.
unsloth/Llama-3.2-11B-Vision-Instruct

taher-itg · November 11, 2024, 1:21pm

It worked when I did it this way:

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Hi how are you?"}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    None,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

You basically pass None where the image object is supposed to go in ‘inputs=Processor(…)’, and remove ‘{‘type’:‘image’}’ from the input list.

Topic		Replies	Views
meta-llama/Llama-3.2-11B-Vision 🤗Hub	4	14	March 12, 2025
Unexpected Output from Official Llama-3.2-11B-Vision-Instruct Example Code Models	11	86395	November 5, 2024
meta-llama/Llama-3.2-11B-Vision "please acept " Models	3	14	March 6, 2025
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12256	January 1, 2024
meta-llama/Llama-3.2-11B-Vision-Instruct did not reply 🤗Transformers	10	12918	October 29, 2024

Issue with Using LLaMA 3.2 11b vision instruct Model for Text-Only Input: Invalid Input Type Error

Related topics