How to optimize performance of CLIP when searching 10_000 images

vincentclaes · October 14, 2022, 2:05pm

I am using CLIP to predict from 1000 images based on 1 sentence.

text="some sentence"
images=[] # a list of images
CHECKPOINT = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(CHECKPOINT)
model = CLIPModel.from_pretrained(CHECKPOINT).to("cuda")
inputs = processor(text=text, images=images, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs.to("cuda"))

What if I want to scale to 10_000 images? I can vertically scale, but can I do it more cost-effective?
Now I load all the images into the processor for each sentence, can I do this more efficiently?

nielsr · October 14, 2022, 2:38pm

Do you mean you want to build a retrieval system, where you want to find the image that matches best with the text?

It’s probably beneficial to embed all your images of your database beforehand using CLIP’s image encoder.

Then, you can use a library like Faiss to efficiently retrieve the image embedding that is closest to the text embedding.

vincentclaes · October 15, 2022, 5:28am

Thanks for the fast response!
Yes, I am currently using the CLIPProcessor, an easy-to-use abstraction layer.
I’ll have a look at what it’s doing under the hood …

Thanks for Faiss. I should have a look at that!
Already found some interesting stuff:

Topic		Replies	Views
Using an image's text and image's embedding from clip with FAISS 🤗Transformers	2	2310	November 21, 2023
CLIP Image to Text search Beginners	0	900	December 19, 2022
Image neural search 🤗 Course Projects	2	703	November 15, 2021
Image/tag retrieval system Beginners	0	593	December 13, 2022
Use OpenAI's CLIP for image search 🤗 Course Projects	21	4336	June 4, 2024

How to optimize performance of CLIP when searching 10_000 images

Related topics