I understand that when using CLIP to perform image classification, we’re treating the classification labels themselves as textual input and getting per-image logits for each image-label pair.
Using the HF API:
from transformers imports CLIPModel, CLIPProcessor
processor = CLIPProcessor.from_pretrained(model_name_or_path)
model = CLIPModel.from_pretrained(model_name_or_path)
text_labels = ["bag", "coat", "jacket"]
images = [image1, image2, image3, image4] # Assuming images are already defined.
inputs = processor(
images=images,
text=text_labels,
return_tensors="pt",
padding=True,
)
outputs = model(**inputs)
The question I have is, what if we wanted to use additional textual information with each label? Building on top of that example, let’s say that each image is a product image with a product name (e.g., Black Coat for Autumn
) associated with it. If I wanted to use these product names along with the textual information, then the text input wouldn’t look like simple labels but would rather look like this:
text_input = [
[
"bags - Black Coat for Autumn",
"coat - Black Coat for Autumn",
"jacket - Black Coat for Autumn",
],
[
"bags - White Socks",
"coat - White Socks",
"jacket - White Socks",
],
]
If I do this, then I can’t simply pass it through the processor. If I wanted to do this without any major changes then I would have to only process one image at a time I believe. Is there any way to still perform batch inference with this?