I am exploring ways to use the SentenceTransformers for contrastive learning. I have to build a similarity-matching network by combining both image and textual features.
Essentially, I have to build a contrastive learning model by considering both image and text features. For this, I have the following steps in my mind:
Use the SentenceTransformer to encode images and text into a single vector space.
I would combine both using SentenceTransformer to create a new vector space.
And then implement contrastive learning for this vector space.
It would be awesome if you could please provide me with your thoughts. Any help with the API or code usage examples would be greatly appreciated.
Technically the 1st and 3rd steps look good to me, but I don’t think step #2 is necessary since the images and text would already be embedded in the same vector space if you run the embeddings using any of the CLIP models (Pretrained Models — Sentence-Transformers documentation).
As far as I can see, SentenceTransformer can give a vector for image and text but not a combined one.
model = SentenceTransformer('clip-ViT-B-16')
#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
But if I want to combine them, still step 2 is needed. Does not look like I can get a joint embedding from both the image and text.
@NimaBoscarino to clarify, I have (image, text) pairs. Given two (image, text) pairs, I have to determine semantic similarity, i.e., whether there is a high cosine similarity between the two (image, text) pairs.
Oooh I see, sorry I think I’d misunderstood the original question. I’d gotten confused and hadn’t realized that these would all be pairs of (text, image), and for some reason had thought that you had individual text and images for your dataset. In that case you’re right that you would need to join them somehow, and my intuition is that maybe you could concatenate the text + image embeddings? I’m not aware of a better way to merge two embeddings.
I’m not quite sure how you would load the training data though, as I think InputExample only takes in text. I’ll have to do some more digging. I know that there was a tutorial for fine-tuning CLIP models with SentenceTransformers that was being worked on but I don’t think it was ever released. I’ll try to get that for you ASAP (will have to be next week since), since that might help. This is also quite a bit out of my area of expertise, so I’ll check with the team to fact-check me on this.
I think you’re dealing with a really interesting problem, and I promise we’ll get it to work!
The model you used here is already a trained CLIP model and therefore a contrastive approach by itself. SentenceTransformers alone can only give representations for text („sentences“).
I used SentenceTransformers in a CLIP-based model were one encoder used frozen representations of the text, computed by SentenceTransformer. I took this path because I had much longer texts than the ones used to train the original CLIP model, which were basically just image captions.