Hey there ,
I want to extract image embedding from the image.
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)
from io import BytesIO
text = "Describe the scene."
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG?download=true"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
# Preprocess inputs
inputs = processor(text=text, images=image, return_tensors="pt").to(device)
embeddings = outputs.image_hidden_states
print(embeddings)
Value for the image_hidden_states is null
can you please advice on this pls
Hey! Paligemma currently doesn’t return image embeddings in forward call. I will add this to my todo list
You can also obtain image hidden states by running the following, first run through vision backbone and get the last hidden state. Then run through the MM projector
image_outputs = model.vision_tower(pixel_values)
selected_image_feature = image_outputs.last_hidden_state
image_features = model.multi_modal_projector(selected_image_feature)
image_features = image_features / (model.config.hidden_size**0.5)
1 Like
Hey Raushan
Thanks for this information
I am able to get the image embedding based upon ur suggestion
Can you also pls advice me how we can get the multimodal and text embedding from the paligemma model.
Currently I thought of using Paligemma for multimodel image search
1 Like
Hey, do you have any update on how to get multi-modal embeddings? maybe it is via self.language_model.get_output_embeddings, which for Gemma
is self.lm_head
?
Sorry, I didn’t see the prev reply. Can you elaborate on what you mean by multimodal embedding?
Paligemma, in contrast to some other VLMs, was not trained for image-text search thus does not have dedicated embedding layers for multimodal. Rather it has a separate image and a separate text embedding, which are usually concated to feed into an LLM. For image-text search among VLMs I know that BLIP had a similar functionaluty, but you might also prefer to use a more classic way like CLIP or Flava
Sorry for the confusion. I meant a single sentence embedding that encodes both image and text at the end of the text decoder. Now that I check in detail the architecture, as you mention there is no embedding for such case.
Based on the code you provided I have a few questions:
- Why do you normalize using
(model.config.hidden_size**0.5)
?
- The embeddings are shaped
(batch size, 256, 2048)
. In the case I wanted a single image representation per image (batch size, 2048)
. would you take the mean pooling or the contextualised embedding of the last image token? (maybe you have another suggestion to reduce the dimensionality)