BLIP How to combine embeddings for multimodal search?


I am currently working on a project for retrieving similar images via Text or Images.

I am using BLIP for the embeddings and this works well. So i embedded all my images for a DB, and when doing a search i am embedding the search query (which is either a Text or an Image) into the same space and am using cosine similarity. This approach works well and easy.

Now i want to look into combining image and text search, for example the query image is a white t-shirt and the text is ‘green’, should retrieve images of green t-shirts.

I tried using the text_encoder with the image embeds and attention mask like this:


And use the output embeddings as search, but this approach did not give good results.

Now, i am quite unsure how i can combine/fuse the embeddings properly. Both embeddings are in the same space, so i do not have to ‘align’ them. Only a combination is necessary.

I tried basic methods (addition, average, taking max/min, etc.) with minimal success.
It would be great if there is a way to fuse the embeddings. Would also help to potentially add more embeddings (apart from image and text) later on.
I also though of creating a FC layer with both embeddings as input, but the problem is that i do not really have training data for this, and am not sure how to properly create it.

Do you have any ideas on a suitable approach?

@ybelkada (I thought i tag you since due to the comment in here : Can BlipForImageTextRetrieval be used to generate captions? · Issue #25189 · huggingface/transformers · GitHub)