Similarity search with combined image and text?

How can I do similarity match by combining both image and text?

Lets stay:

Product1 = Image1, Text1
Product2 = Image2, Text2

I want to do contrastive learning by combining both the image and text.

Is there such a model?

Can anyone please suggest a model?

1 Like

The SentenceTransformer can encode images and text into a single vector space. You could combine both to create a new vector space for products, and then implement contrastive learning for this vector space.

See sentence-transformers/Image_Search.ipynb at master · UKPLab/sentence-transformers · GitHub

Like in the notebook referenced by @raphaelmerx, I also used a pre-trained CLIP model to embed images and text in the same vector space, so you can perform semantic search: Weights & Biases.

@raphaelmerx Do you have a sample code for contrastive learning using SentenceTransformer?

@raphaelmerx I understand the idea of combining the text and image into a single vector space and then implement contrastive learning.

But wondering are you aware of an open source implementation for doing contrastive learning? Or code that I could adapt for this purpose.

@raphaelmerx in the given example, you have shown model.encode to encode images and text. Do you have any example how to apply that for contrastive learning?

I don’t have any code sample of contrastive learning no