How to use SentenceTransformers for contrastive learning?

I am exploring ways to use the SentenceTransformers for contrastive learning. I have to build a similarity-matching network by combining both image and textual features.

Essentially, I have to build a contrastive learning model by considering both image and text features. For this, I have the following steps in my mind:

  • Use the SentenceTransformer to encode images and text into a single vector space.
  • I would combine both using SentenceTransformer to create a new vector space.
  • And then implement contrastive learning for this vector space.

It would be awesome if you could please provide me with your thoughts. Any help with the API or code usage examples would be greatly appreciated.

Technically the 1st and 3rd steps look good to me, but I don’t think step #2 is necessary since the images and text would already be embedded in the same vector space if you run the embeddings using any of the CLIP models (Pretrained Models — Sentence-Transformers documentation).

If you want to train a model from scratch though, you could follow the example in Semantic Textual Similarity (Semantic Textual Similarity — Sentence-Transformers documentation), specifically sentence-transformers/ at master · UKPLab/sentence-transformers · GitHub with some modifications.

Mainly you’d use a transformers CLIP model for the embedding model (like models.CLIPModel('openai/clip-vit-base-patch32) ), and you’d also use MultipleNegativesRankingLoss:
Losses — Sentence-Transformers documentation. Or you could do something similar to this (sentence-transformers/ at master · UKPLab/sentence-transformers · GitHub) to continue training a pretrained model, if you already have one in mind.

As far as I can see, SentenceTransformer can give a vector for image and text but not a combined one.

model = SentenceTransformer('clip-ViT-B-16')

#Encode an image:
img_emb = model.encode('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

But if I want to combine them, still step 2 is needed. Does not look like I can get a joint embedding from both the image and text.

Am I missing something?

@NimaBoscarino to clarify, I have (image, text) pairs. Given two (image, text) pairs, I have to determine semantic similarity, i.e., whether there is a high cosine similarity between the two (image, text) pairs.

Oooh I see, sorry I think I’d misunderstood the original question. I’d gotten confused and hadn’t realized that these would all be pairs of (text, image), and for some reason had thought that you had individual text and images for your dataset. In that case you’re right that you would need to join them somehow, and my intuition is that maybe you could concatenate the text + image embeddings? I’m not aware of a better way to merge two embeddings.

SentenceTransformers lets you create networks from scratch, so maybe you could create a custom torch.nn.Module that takes in both the image and text, and it could pass the image through a separately-trained CLIP model with something like Questions about CLIP · Issue #1175 · UKPLab/sentence-transformers · GitHub, and the text would go through a different (or the same) model? And then that’s where result of the two would be concatenated to form the final embedding.

I’m not quite sure how you would load the training data though, as I think InputExample only takes in text. I’ll have to do some more digging. I know that there was a tutorial for fine-tuning CLIP models with SentenceTransformers that was being worked on but I don’t think it was ever released. I’ll try to get that for you ASAP (will have to be next week since), since that might help. This is also quite a bit out of my area of expertise, so I’ll check with the team to fact-check me on this.

I think you’re dealing with a really interesting problem, and I promise we’ll get it to work!

The model you used here is already a trained CLIP model and therefore a contrastive approach by itself. SentenceTransformers alone can only give representations for text („sentences“).

I used SentenceTransformers in a CLIP-based model were one encoder used frozen representations of the text, computed by SentenceTransformer. I took this path because I had much longer texts than the ones used to train the original CLIP model, which were basically just image captions.