Issue with Using LLaMA 3.2 11b vision instruct Model for Text-Only Input: Invalid Input Type Error

I’m currently working on a project where I am trying to generate text using the meta-llama/Llama-3.2-11B-Vision-Instruct model. However, I’m only interested in processing text, not images. Despite this, when I pass in a text input, I receive the error as “Invalid input type. Must be a single image, a list of images, or a list of batches of images.”

I’m using AutoProcessor to handle inputs, but since I only want to process text, I suspect the processor is expecting an image input due to the model’s multi-modal nature.

How can I configure the meta-llama/Llama-3.2-11B-Vision-Instruct model to process only text without throwing the image-related error?

Is there a different recommended way to handle text-only inputs with this model?

Thanks in advance for your help.

2 Likes

The reason is probably because of this pipeline designation in the repo setup. There are several repos in the copy, and some of them do not have that designation, so let’s give it a try.
unsloth/Llama-3.2-11B-Vision-Instruct

It worked when I did it this way:

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "Hi how are you?"}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    None,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

You basically pass None where the image object is supposed to go in ‘inputs=Processor(…)’, and remove ‘{‘type’:‘image’}’ from the input list.

1 Like