Using pre-embedded images in a VLM

SuccessfulCrab · October 10, 2024, 12:59am

My understanding is that a typical VLM architecture involves two main components: an image encoder and a text decoder (I’ve been using Llama). I’ve been using the model as follows:

from transformers import MllamaForConditionalGeneration
import torch

model = MllamaForConditionalGeneration.from_pretrained(
checkpoint,
torch_dtype=torch.bfloat16,
device_map=“auto”,
)
processor = AutoProcessor.from_pretrained(checkpoint)

image=Image.open(‘…/…/…/Data/CNN_Data/WA/W_IMG_8021.jpg’).convert(“RGB”)

messages = [
{“role”: “user”, “content”: [
{“type”: “image”},
{“type”: “text”, “text”: “If the image is a screenshot of an app, identify app.”}
]}
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors=“pt”
).to(model.device)

output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0]))

However, I have more than one image to process and the text prompts may vary. I was wondering if it is possible to pre-process the image encoder outputs (dynamic embedding, last hidden state) in order to store the image tensors for reuse? This would save on the compute required every time the text prompts change. Is this possible?

Topic		Replies	Views
Asynchronous Data Pre-Processing for Multi Modal Models 🤗Transformers	5	393	October 4, 2024
Image Embedding from PaliGemma Model Beginners	7	534	March 5, 2025
Customizing model architecture from predefined models 🤗Transformers	0	344	March 13, 2024
Replacing the LlamaDecoderLayer Class hugging Face With New LongNet Intermediate	0	782	March 30, 2024
Img2seq model with pretrained weights Beginners	7	1210	November 18, 2021

Using pre-embedded images in a VLM

Related topics