I’m working on product similarity using images and text. Each product has an image and text(description) and Image is embedded with ViT and Text is embedded with BERT.
There are 1000 product examples.
images_embedding.size() = torch.size([1000, 768]) # examples = 1000
text_embedding.size() = torch.size([1000, 768])
I used concatenate method to combine two embeddings using this code
image_text_embed = torch.cat((image_embeddings, text_embeddings), dim=1)
Final embedding size is
For calculating the similarity between query image/text and all embedding of example is used conine similarity method like this.
query_embedding = torch.cat((image_query_embedding, text_query_embedding))
def compute_scores(emb_one, emb_two): """Computes cosine similarity between two vectors.""" scores = torch.nn.functional.cosine_similarity(emb_one, emb_two) scores = scores.cpu() return scores.numpy().tolist() sim_scores = compute_scores(all_candidate_embeddings, query_embedding)
I have tested this method on 100 images of the product but the results of this method was not good. I want to ask how to combine image and text embedding efficiency to achieve prefect result of product similarity. what is the suggestion method for that?
Please tell me some best solutions for that. Thanks a lot.