Discrepancy between OpenAI CLIP and Huggingface CLIP models

I’m fine-tuning the CLIP openai/clip-vit-base-patch32 model and trying to convert my project to use the huggingface library. I swapped out the clip model with the Huggingface version. During training I’m consistently seeing lower loss and AUC metric values although I’m using the same base model, hyper parameters, and data. Micro-averaged AUC drops from about .87 to .79, loss is similarly affected. However with the hf model, loss does decrease and its clear its learning the data, just performance is not as good. So far I haven’t been able to find any cause for the discrepancy. Is this to be expected? Is there something different about the hf version that would require me to modify my inputs? My data is preprocessed, so I’m not using the hugging face tokenizer, though I did attempt that and the results were the same. Any and all help appreciated!

I solved my own problem. I was scaling the visual encoder position embeddings and when porting the code that interpolated the position vectors, I didn’t register the position_ids as a buffer. Fixing this brought performance to an identical level with the original CLIP model.

1 Like