Vector search from text-image pairs : separate or common space?


I have 15M (text, image) product pairs and this volume doubles every 2 months. Given an entry pair (text, image), I need to find the most similar pairs.

  • Text length varies from 1 to 20 words, image quality varies from 1 to 10. All combinations of good/poor text/image are possible
  • Similarity = cosine , storage = Chroma
  • Training and inference budgets are low

This problem is quiet common (see this post) and I’m investigating several options:

  1. Cascade filtering from separate embeddings: find most similar texts, then filter most similar images or vice-versa

  2. Stacked separate embeddings

  • v_image_i = ViT-B-32.encode(image)
  • v_text_i = LLM.encode(text)
  • similarity(i,j) = cosine(np.stack(v_image_i, v_text_i), np.stack(v_image_j, v_text_j))
  1. Unique latent-space

An intuition is that a unique latent-space will capture less fine-grained similarity than separate embeddings.

Given the state-of-art in July 2023, what do you think is the best option? Could anyone share some benchmarks / insights about expected performance?

Thank you :smile: