I’m currently working on a project where I am trying to generate text using the meta-llama/Llama-3.2-11B-Vision-Instruct
model. However, I’m only interested in processing text, not images. Despite this, when I pass in a text input, I receive the error as “Invalid input type. Must be a single image, a list of images, or a list of batches of images.”
I’m using AutoProcessor
to handle inputs, but since I only want to process text, I suspect the processor is expecting an image input due to the model’s multi-modal nature.
How can I configure the meta-llama/Llama-3.2-11B-Vision-Instruct
model to process only text without throwing the image-related error?
Is there a different recommended way to handle text-only inputs with this model?
Thanks in advance for your help.
2 Likes
The reason is probably because of this pipeline designation in the repo setup. There are several repos in the copy, and some of them do not have that designation, so let’s give it a try.
unsloth/Llama-3.2-11B-Vision-Instruct
It worked when I did it this way:
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{"role": "user", "content": [
{"type": "text", "text": "Hi how are you?"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
None,
input_text,
add_special_tokens=False,
return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))
You basically pass None where the image object is supposed to go in ‘inputs=Processor(…)’, and remove ‘{‘type’:‘image’}’ from the input list.
1 Like