How to optimize performance of CLIP when searching 10_000 images

I am using CLIP to predict from 1000 images based on 1 sentence.

text="some sentence"
images=[] # a list of images
CHECKPOINT = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(CHECKPOINT)
model = CLIPModel.from_pretrained(CHECKPOINT).to("cuda")
inputs = processor(text=text, images=images, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs.to("cuda"))
  • What if I want to scale to 10_000 images? I can vertically scale, but can I do it more cost-effective?
  • Now I load all the images into the processor for each sentence, can I do this more efficiently?

Do you mean you want to build a retrieval system, where you want to find the image that matches best with the text?

It’s probably beneficial to embed all your images of your database beforehand using CLIP’s image encoder.

Then, you can use a library like Faiss to efficiently retrieve the image embedding that is closest to the text embedding.

1 Like

Thanks for the fast response!
Yes, I am currently using the CLIPProcessor, an easy-to-use abstraction layer.
I’ll have a look at what it’s doing under the hood …

Thanks for Faiss. I should have a look at that!
Already found some interesting stuff: