I have a dataset with 50K images, each image has a text description associated with it. I want to use each image’s text and image in a semantic search database such as FAISS.
I have been able to use CLIP to embed either each image or each text description. However, given that the text descriptions should aid in classification I am wondering if there is a way to put an image’s text and imagery embedding into the same embedding? Is simply combining the two embeddings a possible solution?
import faiss
index = faiss.IndexFlatL2(1024) #
image_embedding = get_image_embedding(clip_model)
text_embedding = get_text_embedding(clip_model)
combined_embedding = np.concatenate((text_embedding, image_embedding), axis=1)
index.add(combined_embedding)
would be a better approach be two maintain two sperate index’s - one for text and one for imagery - and then take the union of their search results?