CLIPModel finetuning

I am doing the following to finetune CLIPMode further on my own dataset.

But with no luck.

Please advise.

Model load

from transformers import CLIPProcessor, CLIPModel, CLIPConfig
config = CLIPConfig.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel(config)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Data loader

import torch
from torch.utils.data import Dataset
from PIL import Image

class CLIPDataset(Dataset):
    def __init__(self, root_dir, df,processor, max_target_length=32):
        self.root_dir = root_dir
        self.df = df
        self.processor = processor
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        file_name = self.df['image'][idx]
        image = Image.open(self.root_dir + "/" + file_name).convert("RGB")
        text = self.df['title'][idx]

        pixel_values = self.processor.feature_extractor(image, return_tensors="pt").pixel_values
        labels = self.processor.tokenizer(text, 
                                padding="max_length", 
                                max_length=self.max_target_length,
                                truncation=True).input_ids
        labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        return {"input_ids":torch.tensor(labels), "pixel_values":pixel_values.squeeze()}

Using default data_collator with Trainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
    tokenizer=processor.feature_extractor
)

@nielsr / @valhalla / @patrickvonplaten can I seek your advice ?

The Seq2SeqTrainer shouldn’t be used for fine-tuning CLIP, this is only meant for seq2seq models like T5, BART, EncoderDecoderModel, VisionEncoderDecoderModel, etc.

Can you try using the Trainer?

1 Like

Yes Tried Trainer instead,

The CLIPModel forward throws an exception where we are trying to get CLIPVisionEmbeddings from pixel_values.

8 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/clip/modeling_clip.py in forward(self, pixel_values)
    138         class_embeds = self.class_embedding.expand(batch_size, 1, -1)
    139         embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
--> 140         embeddings = embeddings + self.position_embedding(self.position_ids)
    141         return embeddings
    142 
RuntimeError: The size of tensor a (50) must match the size of tensor b (145) at non-singleton dimension 1

It seems like the issue occurs when adding position embeddings to the patch embeddings.

CLIP uses an image size of 224 and a patch size of 32. Hence, the number of patches equals (224 // 32)**2 = 49, and one also adds a CLS token, so the number of embeddings for the image tokens is 49 + 1 = 50.

However, it seems like 145 position embeddings are added. The position IDs are defined on this line. It seems like self.num_positions is set to 145? Can you check if this is the case?

1 Like

Good catch, I had the # config.vision_config.image_size = 384 (my custom images are hi-res). Hence as per the formula 384 //32 = 144 + 1= 145.

Setting back to 224 fixes that.

The forward pass seems to work fine, but looks like we have two modes to return loss

  • Either as a 1st index in the outputs if its a list
  • Or send it as a separate key ‘loss’ if outputs is a dict (I believe for CLIP this is the case0

Apparently, the trainer is complaining the outputs is a dict and it doesnt have the key ‘loss’ (tried both in the stable version as well as nightly version, same result)

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1931         else:
   1932             # We don't use .loss here since the model may return tuples instead of ModelOutput.
-> 1933             loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
   1934 
   1935         return (loss, outputs) if return_outputs else loss

/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
   2325         if isinstance(k, str):
   2326             inner_dict = {k: v for (k, v) in self.items()}
-> 2327             return inner_dict[k]
   2328         else:
   2329             return self.to_tuple()[k]

KeyError: 'loss'

Any advice?

As shown in the docs:

=> your model must either return a tuple or subclasses of ModelOutput such as the ones defined here.

1 Like

[Update]: Trainer works but I see one challenge in the impl

  • Apparently return_loss parameter in forward() needs to be set to True for the model to return loss. So cloned the source and locally removed the if the condition (shown below) that checks for the flag to be True to test. Trainer works !!

  • Challenge is not sure how to set return_loss as a part of my data loader (Dataset) so it goes as True for training.

        loss = None
        if return_loss:
            loss = clip_loss(logits_per_text)

[Original response]:

Thanks this screenshot was helpful. I will check what does the model return.

But I am simply reusing HF CLIPModel implementation (not a custom impl) and as per the code it returns a CLIPOutput (below) which is a concrete implementation of ModelOutput like you rightly pointed, in here loss is cliploss which internally uses similarity based contrastive loss.

 return CLIPOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )