I am using CLIP to predict from 1000 images based on 1 sentence.
text="some sentence" images= # a list of images CHECKPOINT = "openai/clip-vit-base-patch32" processor = CLIPProcessor.from_pretrained(CHECKPOINT) model = CLIPModel.from_pretrained(CHECKPOINT).to("cuda") inputs = processor(text=text, images=images, return_tensors="pt", padding=True, truncation=True) outputs = model(**inputs.to("cuda"))
- What if I want to scale to 10_000 images? I can vertically scale, but can I do it more cost-effective?
- Now I load all the images into the processor for each sentence, can I do this more efficiently?