CLIP like contrastive vision-language models for Italian with pre-trained text and vision models
For this project, a pre-trained image model like
ViT and a pre-trained text model like
BERT can be used as an image encoder and text encoder respectively.
Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/RoBERTa models for the Italian language.
The WIT dataset can be used for this task.
Available training scripts
A training script for this will be provided soon. (see PR)
(Optional) Desired project outcome
The desired outcome is to train a CLIP-like model for Italian language. This can be showcased with a streamlit or gradio app.
This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.
That is super interesting!
We have ViT and Italian checkpoints: can’t wait to see it in action!
Seems like a wonderful idea @vinid!
Would be great to benchmark it against the multilingual version produced by the SentenceBERT team (demo notebook here, docs here) since they were obtaining impressive performances using their multilingual distillation process!
Yes! That would be the perfect testbed!
Great idea @vinid!
I am not an expert in computer vision, but I can definitely help with the project
It’s a great idea! Also, it would be nice to know how they distilled to get the multilingual CLIP.
Do they first distill the text encoder only and then align/fine-tune the multilingual embeddings with image embeddings? Maybe we can borrow some ideas for our CLIP training as well.
@g8a9 They seem to train a bilingual student encoder on parallel data and use MSE as loss against the original ENG text encoder outputs. More info in the paper.
If training is done directly in Italian, MSCOCO-it is definitely an option!
Hello @vinid & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Italian version of CLIP like model. It would be nice if we could discuss some learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team
@gsarti I see, but still I’m not sure how that is adopted in CLIP.
Perhaps, they use the pre-trained CLIP text encoder as the teacher and a bilingual student encoder. Next, they adopt the trained student as the CLIP text encoder.
But now, IMO the multilingual distillation has misaligned text and image embeddings in CLIP’s contrastive space (is that any right?).
I was wondering if they run an additional contrastive-loss pretraining with the new encoder.
I hope we’ll have time to discuss that and more on a dedicated Discord/Slack server if the project gets funded (or even if it doesn’t)
Totally agree on MSCOCO-it. Together with WIT (it), I think they’re a good starting point.
I’d be happy to join this project! I have a background in Computer Vision and a little experience with JAX so I hope I can contribute to it!
Awesome! let’s officially define this project
Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.
@vinid created a channel on the discord server to discuss the project. See you there
Hi everyone! This looks like a promising project, am I too late to join it? Did you start off yet? If not I’ll be glad to leave a comment on the sheet asking for me to be officially added to the team!
Happy to have you on board! We just had a quick meeting today but we can share the notes with you on discord
Ahahaha, I’ve literally just landed on the Discord channel and found that I was too little too late!
I’m writing you there