Using pre-embedded images in a VLM

My understanding is that a typical VLM architecture involves two main components: an image encoder and a text decoder (I’ve been using Llama). I’ve been using the model as follows:

from transformers import MllamaForConditionalGeneration
import torch

model = MllamaForConditionalGeneration.from_pretrained(
checkpoint,
torch_dtype=torch.bfloat16,
device_map=“auto”,
)
processor = AutoProcessor.from_pretrained(checkpoint)

image=Image.open(‘…/…/…/Data/CNN_Data/WA/W_IMG_8021.jpg’).convert(“RGB”)

messages = [
{“role”: “user”, “content”: [
{“type”: “image”},
{“type”: “text”, “text”: “If the image is a screenshot of an app, identify app.”}
]}
]

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors=“pt”
).to(model.device)

output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0]))

However, I have more than one image to process and the text prompts may vary. I was wondering if it is possible to pre-process the image encoder outputs (dynamic embedding, last hidden state) in order to store the image tensors for reuse? This would save on the compute required every time the text prompts change. Is this possible?

1 Like