Using an image's text and image's embedding from clip with FAISS

I have a dataset with 50K images, each image has a text description associated with it. I want to use each image’s text and image in a semantic search database such as FAISS.

I have been able to use CLIP to embed either each image or each text description. However, given that the text descriptions should aid in classification I am wondering if there is a way to put an image’s text and imagery embedding into the same embedding? Is simply combining the two embeddings a possible solution?

import faiss                   
index = faiss.IndexFlatL2(1024) #    
image_embedding = get_image_embedding(clip_model)
text_embedding = get_text_embedding(clip_model)
combined_embedding = np.concatenate((text_embedding, image_embedding), axis=1)

would be a better approach be two maintain two sperate index’s - one for text and one for imagery - and then take the union of their search results?

somewhat unrelated to transformers but I found FAISS does support combining search results from multiple index’s via a function called ResultHeap

There is this paper that suggests a way. They are just concatenating the two vectors together.

Depending on the vector sizes you have, you can look at ways to reduce the size if needed. There is a few different approaches. I haven’t done this though.

