For this project, a pre-trained image model like
ViT and a pre-trained text model like
BERT can be used as an image encoder and text encoder respectively.
The WIT dataset can be used for this task.
A training script for this will be provided soon. (see PR)
The desired outcome is to train a CLIP-like model for Spanish language. This can be showcased with a streamlit or gradio app.
This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.