Image Embedding from PaliGemma Model

Hey there ,

I want to extract image embedding from the image.

from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch

model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)

from io import BytesIO
text = "Describe the scene."
image_url =  "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG?download=true"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

# Preprocess inputs
inputs = processor(text=text, images=image, return_tensors="pt").to(device)
embeddings = outputs.image_hidden_states
print(embeddings)

Value for the image_hidden_states is null
can you please advice on this pls

Hey! Paligemma currently doesn’t return image embeddings in forward call. I will add this to my todo list :slight_smile:

You can also obtain image hidden states by running the following, first run through vision backbone and get the last hidden state. Then run through the MM projector

image_outputs = model.vision_tower(pixel_values)
selected_image_feature = image_outputs.last_hidden_state
image_features = model.multi_modal_projector(selected_image_feature)
image_features = image_features / (model.config.hidden_size**0.5)
1 Like

Hey Raushan

Thanks for this information :grinning:
I am able to get the image embedding based upon ur suggestion
Can you also pls advice me how we can get the multimodal and text embedding from the paligemma model.
Currently I thought of using Paligemma for multimodel image search

1 Like

Hey, do you have any update on how to get multi-modal embeddings? maybe it is via self.language_model.get_output_embeddings, which for Gemma is self.lm_head?

Sorry, I didn’t see the prev reply. Can you elaborate on what you mean by multimodal embedding?

Paligemma, in contrast to some other VLMs, was not trained for image-text search thus does not have dedicated embedding layers for multimodal. Rather it has a separate image and a separate text embedding, which are usually concated to feed into an LLM. For image-text search among VLMs I know that BLIP had a similar functionaluty, but you might also prefer to use a more classic way like CLIP or Flava

Sorry for the confusion. I meant a single sentence embedding that encodes both image and text at the end of the text decoder. Now that I check in detail the architecture, as you mention there is no embedding for such case.

Based on the code you provided I have a few questions:

  1. Why do you normalize using (model.config.hidden_size**0.5)?
  2. The embeddings are shaped (batch size, 256, 2048). In the case I wanted a single image representation per image (batch size, 2048). would you take the mean pooling or the contextualised embedding of the last image token? (maybe you have another suggestion to reduce the dimensionality)
  1. The normalization is done because the Paligemma model internally does the same thing before merging image and text embeddings. The code snippet is simply copy-paste from library codebase
  2. Yes, since the ViT backbone returns images of patch_size * image_size length. You can try to do mean pooling or any other method, but you should note that changing model architecture means we’ll need to train the model. There are no available weights afaik for Paligemma image-text retrieval. That is why I was suggesting as an easy option take one of pre-trained weights on other models and fine-tune on your own domain which will be easier and less resource consuming
1 Like