Image Embedding from PaliGemma Model

Samvardhan777 · August 20, 2024, 7:34pm

Hey there ,

I want to extract image embedding from the image.

from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch

model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)

from io import BytesIO
text = "Describe the scene."
image_url =  "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG?download=true"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

# Preprocess inputs
inputs = processor(text=text, images=image, return_tensors="pt").to(device)
embeddings = outputs.image_hidden_states
print(embeddings)

Value for the image_hidden_states is null
can you please advice on this pls

RaushanTurganbay · August 21, 2024, 5:43am

Hey! Paligemma currently doesn’t return image embeddings in forward call. I will add this to my todo list

You can also obtain image hidden states by running the following, first run through vision backbone and get the last hidden state. Then run through the MM projector

image_outputs = model.vision_tower(pixel_values)
selected_image_feature = image_outputs.last_hidden_state
image_features = model.multi_modal_projector(selected_image_feature)
image_features = image_features / (model.config.hidden_size**0.5)

Samvardhan777 · August 21, 2024, 7:07am

Hey Raushan

Thanks for this information
I am able to get the image embedding based upon ur suggestion
Can you also pls advice me how we can get the multimodal and text embedding from the paligemma model.
Currently I thought of using Paligemma for multimodel image search

malba96 · October 21, 2024, 8:27am

Hey, do you have any update on how to get multi-modal embeddings? maybe it is via self.language_model.get_output_embeddings, which for Gemma is self.lm_head?

RaushanTurganbay · October 21, 2024, 5:34pm

Sorry, I didn’t see the prev reply. Can you elaborate on what you mean by multimodal embedding?

Paligemma, in contrast to some other VLMs, was not trained for image-text search thus does not have dedicated embedding layers for multimodal. Rather it has a separate image and a separate text embedding, which are usually concated to feed into an LLM. For image-text search among VLMs I know that BLIP had a similar functionaluty, but you might also prefer to use a more classic way like CLIP or Flava

malba96 · October 21, 2024, 8:03pm

Sorry for the confusion. I meant a single sentence embedding that encodes both image and text at the end of the text decoder. Now that I check in detail the architecture, as you mention there is no embedding for such case.

Based on the code you provided I have a few questions:

Why do you normalize using (model.config.hidden_size**0.5)?
The embeddings are shaped (batch size, 256, 2048). In the case I wanted a single image representation per image (batch size, 2048). would you take the mean pooling or the contextualised embedding of the last image token? (maybe you have another suggestion to reduce the dimensionality)

RaushanTurganbay · October 22, 2024, 8:41am

The normalization is done because the Paligemma model internally does the same thing before merging image and text embeddings. The code snippet is simply copy-paste from library codebase
Yes, since the ViT backbone returns images of patch_size * image_size length. You can try to do mean pooling or any other method, but you should note that changing model architecture means we’ll need to train the model. There are no available weights afaik for Paligemma image-text retrieval. That is why I was suggesting as an easy option take one of pre-trained weights on other models and fine-tune on your own domain which will be easier and less resource consuming

ishity · March 5, 2025, 1:37am

Hi! I want to know how to get both text and image embeddings. I tried the code which are shown in the first and second messages, but I can’t run them. Also, I tried to get them by using another VL model, so if you have some tips for doing these kinds of things, please tell me!

Topic		Replies	Views
FlavaModel multimodal_embeddings shape and text_embeddings shape is not match 🤗Transformers	0	15	December 23, 2024
How to get Visual/Text/Multimodal Embedding from llava Model Beginners	3	1359	December 11, 2024
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12203	January 1, 2024
Retrieve BlipForImageTextRetrieval image features 🤗Transformers	0	399	September 17, 2023
Blip-2 for extraction of image and text embeddings 🤗Transformers	0	616	September 20, 2024

Image Embedding from PaliGemma Model

Related topics