Image captioning for Indonesia with pre-trained vision

Image captioning for Indonesia with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.


Pre-trained ViT, BERT models can be found on the model hub. We could use multi-lingual BERT/ROERTa models for the Indonesian language.


The WIT dataset can be used for this task. It has almost over 200K image-text pairs for Indonesia.
The GEM dataset can also be used for the task

Available training scripts

As this will be a Seq2Seq model, the script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Indonesia language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

-This model will require some modifications to the existing text models.
-Data Processing

Count me in!

1 Like

I am also in

I am also in

Great! Let’s define the project :slight_smile: cc @valhalla

1 Like