How is additional text information used for image classification using CLIP?

seanswyi · November 5, 2023, 12:36am

I understand that when using CLIP to perform image classification, we’re treating the classification labels themselves as textual input and getting per-image logits for each image-label pair.

Using the HF API:

from transformers imports CLIPModel, CLIPProcessor

processor = CLIPProcessor.from_pretrained(model_name_or_path)
model = CLIPModel.from_pretrained(model_name_or_path)

text_labels = ["bag", "coat", "jacket"]
images = [image1, image2, image3, image4]  # Assuming images are already defined.

inputs = processor(
    images=images,
    text=text_labels,
    return_tensors="pt",
    padding=True,
)
outputs = model(**inputs)

The question I have is, what if we wanted to use additional textual information with each label? Building on top of that example, let’s say that each image is a product image with a product name (e.g., Black Coat for Autumn) associated with it. If I wanted to use these product names along with the textual information, then the text input wouldn’t look like simple labels but would rather look like this:

text_input = [
    [
        "bags - Black Coat for Autumn",
        "coat - Black Coat for Autumn",
        "jacket - Black Coat for Autumn",
    ],
    [
        "bags - White Socks",
        "coat - White Socks",
        "jacket - White Socks",
    ],
]

If I do this, then I can’t simply pass it through the processor. If I wanted to do this without any major changes then I would have to only process one image at a time I believe. Is there any way to still perform batch inference with this?

Topic		Replies	Views
CLIP Image to Text search Beginners	0	898	December 19, 2022
CLIP scores, with vector input rather than image input 🤗Transformers	0	263	April 15, 2024
How to obtain correct text embeddings from CLIP? 🤗Transformers	1	8934	February 6, 2023
Fine tuning CLIP Transformer for downstream task 🤗Transformers	1	3247	February 2, 2024
Use OpenAI's CLIP for image search 🤗 Course Projects	21	4325	June 4, 2024

How is additional text information used for image classification using CLIP?

Related topics