For this project, a pre-trained image model like
ViT and a pre-trained text model like
BERT can be used as an image encoder and text encoder respectively.
Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/RoBERTa models for the Indonesian language.
A training script for this will be provided soon. (see PR)
The desired outcome is to train a CLIP-like model for Indonesian language. This can be showcased with a streamlit or gradio app.
This model will require some modifications to the existing models. Specifically, we will need to add projection layers in both the text and image encoder models.