How to combine Image and Text embedding for product similarity

Hi everyone.

I’m working on product similarity using images and text. Each product has an image and text(description) and Image is embedded with ViT and Text is embedded with BERT.
There are 1000 product examples.
images_embedding.size() = torch.size([1000, 768]) # examples = 1000
text_embedding.size() = torch.size([1000, 768])

I used concatenate method to combine two embeddings using this code image_text_embed =, text_embeddings), dim=1)

Final embedding size is torch.Size([1000, 1536])

For calculating the similarity between query image/text and all embedding of example is used conine similarity method like this.
query_embedding =, text_query_embedding))

Similarity function:

def compute_scores(emb_one, emb_two):
    """Computes cosine similarity between two vectors."""
    scores = torch.nn.functional.cosine_similarity(emb_one, emb_two)
    scores = scores.cpu()
    return scores.numpy().tolist()

sim_scores = compute_scores(all_candidate_embeddings, query_embedding)

I have tested this method on 100 images of the product but the results of this method was not good. I want to ask how to combine image and text embedding efficiency to achieve prefect result of product similarity. what is the suggestion method for that?
Please tell me some best solutions for that. Thanks a lot.


Without any fine-tuning, this won’t work as the embeddings aren’t aligned. One would need to train the models to make sure similar product images and their names are embedded closely to each other in the embedding space.

I’d recommend using CLIP which has a vision encoder and text encoder whose embedding space are aligned with each other: CLIP.

People have already fine-tuned CLIP on various domains, like fashion: patrickjohncyh/fashion-clip · Hugging Face

A script to fine-tune CLIP can be found here: transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers · GitHub

Update: in 2024 there’s a better CLIP model now, which is called SigLIP. It’s the same as CLIP but trained with a sigmoid loss instead of softmax. Various checkpoints are released, including a multilingual one: SigLIP - a google Collection.