How to combine Image and Text embedding for product similarity

miladfa7 · July 22, 2023, 8:48pm

Hi everyone.

I’m working on product similarity using images and text. Each product has an image and text(description) and Image is embedded with ViT and Text is embedded with BERT.
There are 1000 product examples.
images_embedding.size() = torch.size([1000, 768]) # examples = 1000
text_embedding.size() = torch.size([1000, 768])

I used concatenate method to combine two embeddings using this code image_text_embed = torch.cat((image_embeddings, text_embeddings), dim=1)

Final embedding size is torch.Size([1000, 1536])

For calculating the similarity between query image/text and all embedding of example is used conine similarity method like this.
query_embedding = torch.cat((image_query_embedding, text_query_embedding))

Similarity function:

def compute_scores(emb_one, emb_two):
    """Computes cosine similarity between two vectors."""
    scores = torch.nn.functional.cosine_similarity(emb_one, emb_two)
    scores = scores.cpu()
    return scores.numpy().tolist()

sim_scores = compute_scores(all_candidate_embeddings, query_embedding)

I have tested this method on 100 images of the product but the results of this method was not good. I want to ask how to combine image and text embedding efficiency to achieve prefect result of product similarity. what is the suggestion method for that?
Please tell me some best solutions for that. Thanks a lot.

nielsr · July 23, 2023, 6:52pm

Hi,

Without any fine-tuning, this won’t work as the embeddings aren’t aligned. One would need to train the models to make sure similar product images and their names are embedded closely to each other in the embedding space.

I’d recommend using CLIP which has a vision encoder and text encoder whose embedding space are aligned with each other: CLIP.

People have already fine-tuned CLIP on various domains, like fashion: patrickjohncyh/fashion-clip · Hugging Face

A script to fine-tune CLIP can be found here: transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers · GitHub

Update: in 2024 there’s a better CLIP model now, which is called SigLIP. It’s the same as CLIP but trained with a sigmoid loss instead of softmax. Various checkpoints are released, including a multilingual one: SigLIP - a google Collection.

mdyusuf2528 · May 6, 2025, 2:50pm

Embedding Vectors must be in the same Vector Space, So try to find & Use the Model Which one shares the same vector space for both

Topic		Replies	Views
BLIP How to combine embeddings for multimodal search? Intermediate	1	2006	January 11, 2024
Vector search from text-image pairs : separate or common space? Intermediate	0	361	July 11, 2023
Similarity search with combined image and text? Research	6	3166	June 24, 2022
Using an image's text and image's embedding from clip with FAISS 🤗Transformers	2	2310	November 21, 2023
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4592	December 6, 2022

How to combine Image and Text embedding for product similarity

Related topics