The Seq2SeqTrainer shouldn’t be used for fine-tuning CLIP, this is only meant for seq2seq models like T5, BART, EncoderDecoderModel, VisionEncoderDecoderModel, etc.
The CLIPModel forward throws an exception where we are trying to get CLIPVisionEmbeddings from pixel_values.
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/clip/modeling_clip.py in forward(self, pixel_values)
138 class_embeds = self.class_embedding.expand(batch_size, 1, -1)
139 embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
--> 140 embeddings = embeddings + self.position_embedding(self.position_ids)
141 return embeddings
142
RuntimeError: The size of tensor a (50) must match the size of tensor b (145) at non-singleton dimension 1
It seems like the issue occurs when adding position embeddings to the patch embeddings.
CLIP uses an image size of 224 and a patch size of 32. Hence, the number of patches equals (224 // 32)**2 = 49, and one also adds a CLS token, so the number of embeddings for the image tokens is 49 + 1 = 50.
However, it seems like 145 position embeddings are added. The position IDs are defined on this line. It seems like self.num_positions is set to 145? Can you check if this is the case?
Good catch, I had the # config.vision_config.image_size = 384 (my custom images are hi-res). Hence as per the formula 384 //32 = 144 + 1= 145.
Setting back to 224 fixes that.
The forward pass seems to work fine, but looks like we have two modes to return loss
Either as a 1st index in the outputs if its a list
Or send it as a separate key ‘loss’ if outputs is a dict (I believe for CLIP this is the case0
Apparently, the trainer is complaining the outputs is a dict and it doesnt have the key ‘loss’ (tried both in the stable version as well as nightly version, same result)
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
1931 else:
1932 # We don't use .loss here since the model may return tuples instead of ModelOutput.
-> 1933 loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
1934
1935 return (loss, outputs) if return_outputs else loss
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
2325 if isinstance(k, str):
2326 inner_dict = {k: v for (k, v) in self.items()}
-> 2327 return inner_dict[k]
2328 else:
2329 return self.to_tuple()[k]
KeyError: 'loss'
[Update]: Trainer works but I see one challenge in the impl
Apparently return_loss parameter in forward() needs to be set to True for the model to return loss. So cloned the source and locally removed the if the condition (shown below) that checks for the flag to be True to test. Trainer works !!
Challenge is not sure how to set return_loss as a part of my data loader (Dataset) so it goes as True for training.
loss = None
if return_loss:
loss = clip_loss(logits_per_text)
[Original response]:
Thanks this screenshot was helpful. I will check what does the model return.
But I am simply reusing HF CLIPModel implementation (not a custom impl) and as per the code it returns a CLIPOutput (below) which is a concrete implementation of ModelOutput like you rightly pointed, in here loss is cliploss which internally uses similarity based contrastive loss.