CLIP like contrastive vision-language models for Indonesian with pre-trained text and vision models
For this project, a pre-trained image model like ViT
and a pre-trained text model like BERT
can be used as an image encoder and text encoder respectively.
Model
Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/RoBERTa models for the Indonesian language.
Datasets
The WIT dataset can be used for this task.
The GEM dataset can also be used for the task
Available training scripts
A training script for this will be provided soon. (see PR)
(Optional) Desired project outcome
The desired outcome is to train a CLIP-like model for Indonesian language. This can be showcased with a streamlit or gradio app.
(Optional) Challenges
This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.