Hi,
I have 15M (text, image) product pairs and this volume doubles every 2 months. Given an entry pair (text, image), I need to find the most similar pairs.
- Text length varies from 1 to 20 words, image quality varies from 1 to 10. All combinations of good/poor text/image are possible
- Similarity = cosine , storage = Chroma
- Training and inference budgets are low
This problem is quiet common (see this post) and I’m investigating several options:
-
Cascade filtering from separate embeddings: find most similar texts, then filter most similar images or vice-versa
-
Stacked separate embeddings
- v_image_i = ViT-B-32.encode(image)
- v_text_i = LLM.encode(text)
- similarity(i,j) = cosine(np.stack(v_image_i, v_text_i), np.stack(v_image_j, v_text_j))
- Unique latent-space
- With any VisionTextDualEncoder from HF: similarity(i,j) = cosine(v_latent_i, v_latent_j)
- ASIF common space from top frozen encoders ([2210.01738] ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training)
An intuition is that a unique latent-space will capture less fine-grained similarity than separate embeddings.
Given the state-of-art in July 2023, what do you think is the best option? Could anyone share some benchmarks / insights about expected performance?
Thank you
Tom