CLIP like contrastive vision-language models for Italian with pre-trained text and vision models

CLIP like contrastive vision-language models for Italian with pre-trained text and vision models

For this project, a pre-trained image model like ViT and a pre-trained text model like BERT can be used as an image encoder and text encoder respectively.


Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/RoBERTa models for the Italian language.


The WIT dataset can be used for this task.

Available training scripts

A training script for this will be provided soon. (see PR)

(Optional) Desired project outcome

The desired outcome is to train a CLIP-like model for Italian language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.


That is super interesting! :scream_cat:
We have ViT and Italian checkpoints: can’t wait to see it in action!


Seems like a wonderful idea @vinid!

Would be great to benchmark it against the multilingual version produced by the SentenceBERT team (demo notebook here, docs here) since they were obtaining impressive performances using their multilingual distillation process!


Yes! That would be the perfect testbed! :smiley:

1 Like

Great idea @vinid!
I am not an expert in computer vision, but I can definitely help with the project :blush:

1 Like

It’s a great idea! Also, it would be nice to know how they distilled to get the multilingual CLIP.

Do they first distill the text encoder only and then align/fine-tune the multilingual embeddings with image embeddings? Maybe we can borrow some ideas for our CLIP training as well.

1 Like

@g8a9 They seem to train a bilingual student encoder on parallel data and use MSE as loss against the original ENG text encoder outputs. More info in the paper.

If training is done directly in Italian, MSCOCO-it is definitely an option!

1 Like

Hello @vinid & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Italian version of CLIP like model. It would be nice if we could discuss some learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team

1 Like

@gsarti I see, but still I’m not sure how that is adopted in CLIP.

Perhaps, they use the pre-trained CLIP text encoder as the teacher and a bilingual student encoder. Next, they adopt the trained student as the CLIP text encoder.
But now, IMO the multilingual distillation has misaligned text and image embeddings in CLIP’s contrastive space (is that any right?).
I was wondering if they run an additional contrastive-loss pretraining with the new encoder.

I hope we’ll have time to discuss that and more on a dedicated Discord/Slack server if the project gets funded :slight_smile: (or even if it doesn’t)

Totally agree on MSCOCO-it. Together with WIT (it), I think they’re a good starting point.

1 Like

I’d be happy to join this project! I have a background in Computer Vision and a little experience with JAX so I hope I can contribute to it!

Awesome! let’s officially define this project :slight_smile:

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

1 Like

@vinid created a channel on the discord server to discuss the project. See you there :slight_smile:


Hi everyone! This looks like a promising project, am I too late to join it? Did you start off yet? If not I’ll be glad to leave a comment on the sheet asking for me to be officially added to the team! :slight_smile:

1 Like

Happy to have you on board! We just had a quick meeting today but we can share the notes with you on discord :slight_smile:

1 Like

Ahahaha, I’ve literally just landed on the Discord channel and found that I was too little too late!
I’m writing you there