CLIPModel finetuning

prithivida · January 5, 2022, 7:38am

I am doing the following to finetune CLIPMode further on my own dataset.

But with no luck.

Please advise.

Model load

from transformers import CLIPProcessor, CLIPModel, CLIPConfig
config = CLIPConfig.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel(config)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Data loader

import torch
from torch.utils.data import Dataset
from PIL import Image

class CLIPDataset(Dataset):
    def __init__(self, root_dir, df,processor, max_target_length=32):
        self.root_dir = root_dir
        self.df = df
        self.processor = processor
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        file_name = self.df['image'][idx]
        image = Image.open(self.root_dir + "/" + file_name).convert("RGB")
        text = self.df['title'][idx]

        pixel_values = self.processor.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.processor.tokenizer(text, 
                                padding="max_length", 
                                max_length=self.max_target_length,
                                truncation=True).input_ids
        labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        return {"input_ids":torch.tensor(labels), "pixel_values":pixel_values.squeeze()}

Using default data_collator with Trainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
    tokenizer=processor.feature_extractor
)

prithivida · January 10, 2022, 6:01am

@nielsr / @valhalla / @patrickvonplaten can I seek your advice ?

nielsr · January 11, 2022, 9:22am

The Seq2SeqTrainer shouldn’t be used for fine-tuning CLIP, this is only meant for seq2seq models like T5, BART, EncoderDecoderModel, VisionEncoderDecoderModel, etc.

Can you try using the Trainer?

prithivida · January 12, 2022, 11:28am

Yes Tried Trainer instead,

The CLIPModel forward throws an exception where we are trying to get CLIPVisionEmbeddings from pixel_values.

8 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/clip/modeling_clip.py in forward(self, pixel_values)
    138         class_embeds = self.class_embedding.expand(batch_size, 1, -1)
    139         embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
--> 140         embeddings = embeddings + self.position_embedding(self.position_ids)
    141         return embeddings
    142 
RuntimeError: The size of tensor a (50) must match the size of tensor b (145) at non-singleton dimension 1

nielsr · January 13, 2022, 9:11am

It seems like the issue occurs when adding position embeddings to the patch embeddings.

CLIP uses an image size of 224 and a patch size of 32. Hence, the number of patches equals (224 // 32)**2 = 49, and one also adds a CLS token, so the number of embeddings for the image tokens is 49 + 1 = 50.

However, it seems like 145 position embeddings are added. The position IDs are defined on this line. It seems like self.num_positions is set to 145? Can you check if this is the case?

prithivida · January 13, 2022, 2:12pm

Good catch, I had the # config.vision_config.image_size = 384 (my custom images are hi-res). Hence as per the formula 384 //32 = 144 + 1= 145.

Setting back to 224 fixes that.

The forward pass seems to work fine, but looks like we have two modes to return loss

Either as a 1st index in the outputs if its a list
Or send it as a separate key ‘loss’ if outputs is a dict (I believe for CLIP this is the case0

Apparently, the trainer is complaining the outputs is a dict and it doesnt have the key ‘loss’ (tried both in the stable version as well as nightly version, same result)

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1931         else:
   1932             # We don't use .loss here since the model may return tuples instead of ModelOutput.
-> 1933             loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
   1934 
   1935         return (loss, outputs) if return_outputs else loss

/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
   2325         if isinstance(k, str):
   2326             inner_dict = {k: v for (k, v) in self.items()}
-> 2327             return inner_dict[k]
   2328         else:
   2329             return self.to_tuple()[k]

KeyError: 'loss'

Any advice?

nielsr · January 13, 2022, 3:59pm

As shown in the docs:

=> your model must either return a tuple or subclasses of ModelOutput such as the ones defined here.

prithivida · January 13, 2022, 4:44pm

[Update]: Trainer works but I see one challenge in the impl

Apparently return_loss parameter in forward() needs to be set to True for the model to return loss. So cloned the source and locally removed the if the condition (shown below) that checks for the flag to be True to test. Trainer works !!
Challenge is not sure how to set return_loss as a part of my data loader (Dataset) so it goes as True for training.

        loss = None
        if return_loss:
            loss = clip_loss(logits_per_text)

[Original response]:

Thanks this screenshot was helpful. I will check what does the model return.

But I am simply reusing HF CLIPModel implementation (not a custom impl) and as per the code it returns a CLIPOutput (below) which is a concrete implementation of ModelOutput like you rightly pointed, in here loss is cliploss which internally uses similarity based contrastive loss.

 return CLIPOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )

gheissariniloofar · July 16, 2022, 1:08am

Any solution on this ? Adding labels generates this bug :

return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument ‘labels’

nielsr · July 20, 2022, 9:27am

CLIPModel doesn’t accept a labels argument, as the model is trained in a contrastive way on (image, text) pairs.

To train or fine-tune CLIP, refer to the example script: transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers · GitHub

Topic		Replies	Views
Fine tuning CLIP Transformer for downstream task 🤗Transformers	1	2884	February 2, 2024
Trouble saving and loading a finetuned model Beginners	1	206	July 7, 2024
Finetune BLIP on customer dataset #20893 Models	22	7026	September 16, 2024
Load CLIP pretrained model on GPU Beginners	6	7208	March 6, 2024
Converting CLIP to CoreML 🤗Transformers	13	2937	December 12, 2023

CLIPModel finetuning

Related topics