How is additional text information used for image classification using CLIP?

I understand that when using CLIP to perform image classification, we’re treating the classification labels themselves as textual input and getting per-image logits for each image-label pair.

Using the HF API:

from transformers imports CLIPModel, CLIPProcessor

processor = CLIPProcessor.from_pretrained(model_name_or_path)
model = CLIPModel.from_pretrained(model_name_or_path)

text_labels = ["bag", "coat", "jacket"]
images = [image1, image2, image3, image4]  # Assuming images are already defined.

inputs = processor(
    images=images,
    text=text_labels,
    return_tensors="pt",
    padding=True,
)
outputs = model(**inputs)

The question I have is, what if we wanted to use additional textual information with each label? Building on top of that example, let’s say that each image is a product image with a product name (e.g., Black Coat for Autumn) associated with it. If I wanted to use these product names along with the textual information, then the text input wouldn’t look like simple labels but would rather look like this:

text_input = [
    [
        "bags - Black Coat for Autumn",
        "coat - Black Coat for Autumn",
        "jacket - Black Coat for Autumn",
    ],
    [
        "bags - White Socks",
        "coat - White Socks",
        "jacket - White Socks",
    ],
]

If I do this, then I can’t simply pass it through the processor. If I wanted to do this without any major changes then I would have to only process one image at a time I believe. Is there any way to still perform batch inference with this?

1 Like