Hey! Paligemma currently doesn’t return image embeddings in forward call. I will add this to my todo list
You can also obtain image hidden states by running the following, first run through vision backbone and get the last hidden state. Then run through the MM projector
Thanks for this information
I am able to get the image embedding based upon ur suggestion
Can you also pls advice me how we can get the multimodal and text embedding from the paligemma model.
Currently I thought of using Paligemma for multimodel image search
Hey, do you have any update on how to get multi-modal embeddings? maybe it is via self.language_model.get_output_embeddings, which for Gemma is self.lm_head?
Sorry, I didn’t see the prev reply. Can you elaborate on what you mean by multimodal embedding?
Paligemma, in contrast to some other VLMs, was not trained for image-text search thus does not have dedicated embedding layers for multimodal. Rather it has a separate image and a separate text embedding, which are usually concated to feed into an LLM. For image-text search among VLMs I know that BLIP had a similar functionaluty, but you might also prefer to use a more classic way like CLIP or Flava
Sorry for the confusion. I meant a single sentence embedding that encodes both image and text at the end of the text decoder. Now that I check in detail the architecture, as you mention there is no embedding for such case.
Based on the code you provided I have a few questions:
Why do you normalize using (model.config.hidden_size**0.5)?
The embeddings are shaped (batch size, 256, 2048). In the case I wanted a single image representation per image (batch size, 2048). would you take the mean pooling or the contextualised embedding of the last image token? (maybe you have another suggestion to reduce the dimensionality)
The normalization is done because the Paligemma model internally does the same thing before merging image and text embeddings. The code snippet is simply copy-paste from library codebase
Yes, since the ViT backbone returns images of patch_size * image_size length. You can try to do mean pooling or any other method, but you should note that changing model architecture means we’ll need to train the model. There are no available weights afaik for Paligemma image-text retrieval. That is why I was suggesting as an easy option take one of pre-trained weights on other models and fine-tune on your own domain which will be easier and less resource consuming
Hi! I want to know how to get both text and image embeddings. I tried the code which are shown in the first and second messages, but I can’t run them. Also, I tried to get them by using another VL model, so if you have some tips for doing these kinds of things, please tell me!