Vector search from text-image pairs : separate or common space?

Tom1410 · July 11, 2023, 12:59pm

Hi,

I have 15M (text, image) product pairs and this volume doubles every 2 months. Given an entry pair (text, image), I need to find the most similar pairs.

Text length varies from 1 to 20 words, image quality varies from 1 to 10. All combinations of good/poor text/image are possible
Similarity = cosine , storage = Chroma
Training and inference budgets are low

This problem is quiet common (see this post) and I’m investigating several options:

Cascade filtering from separate embeddings: find most similar texts, then filter most similar images or vice-versa
Stacked separate embeddings

v_image_i = ViT-B-32.encode(image)
v_text_i = LLM.encode(text)
similarity(i,j) = cosine(np.stack(v_image_i, v_text_i), np.stack(v_image_j, v_text_j))

Unique latent-space

With any VisionTextDualEncoder from HF: similarity(i,j) = cosine(v_latent_i, v_latent_j)
ASIF common space from top frozen encoders ([2210.01738] ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training)

An intuition is that a unique latent-space will capture less fine-grained similarity than separate embeddings.

Given the state-of-art in July 2023, what do you think is the best option? Could anyone share some benchmarks / insights about expected performance?

Thank you
Tom

Topic		Replies	Views
How to combine Image and Text embedding for product similarity Models	2	17036	May 6, 2025
Similarity search with combined image and text? Research	6	3165	June 24, 2022
BLIP How to combine embeddings for multimodal search? Intermediate	1	2005	January 11, 2024
Use VisionTextDualEncoder for image-text retrieval Intermediate	0	586	December 13, 2022
Image neural search 🤗 Course Projects	2	703	November 15, 2021

Vector search from text-image pairs : separate or common space?

Related topics