The SentenceTransformer can encode images and text into a single vector space. You could combine both to create a new vector space for products, and then implement contrastive learning for this vector space.
Like in the notebook referenced by @raphaelmerx, I also used a pre-trained CLIP model to embed images and text in the same vector space, so you can perform semantic search: Weights & Biases.
@raphaelmerx in the given example, you have shown model.encode to encode images and text. Do you have any example how to apply that for contrastive learning?