CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models

CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models

For this project, a pre-trained image model like ViT and a pre-trained text model like BERT can be used as an image encoder and text encoder respectively.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/ROBERTa models for the Spanish language.

Datasets

The WIT dataset can be used for this task.

Available training scripts

A training script for this will be provided soon. (see PR)

(Optional) Desired project outcome

The desired outcome is to train a CLIP-like model for Spanish language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.

4 Likes

Count me in!

1 Like

We will do it!

1 Like

I would love to participate too! :slight_smile:

2 Likes

Awesome! Let’s define the project :slight_smile:

1 Like