Image captioning for Indonesia with pre-trained vision

munggok · June 24, 2021, 6:40pm

Image captioning for Indonesia with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could use multi-lingual BERT/ROERTa models for the Indonesian language.

Datasets

The WIT dataset can be used for this task. It has almost over 200K image-text pairs for Indonesia.
The GEM dataset can also be used for the task

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Indonesia language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

-This model will require some modifications to the existing text models.
-Data Processing

Galuh · June 24, 2021, 11:16pm

Count me in!

cahya · June 25, 2021, 7:05pm

I am also in

ayameRushia · June 29, 2021, 9:28am

I am also in

patrickvonplaten · June 29, 2021, 3:16pm

Great! Let’s define the project cc @valhalla

Topic		Replies	Views
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1165	June 23, 2021
IndoClip : Pre Training Clip for Indonesian dataset Flax/JAX Projects	3	479	June 30, 2021
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2159	January 4, 2022
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2473	July 19, 2021
Multilingual Image Captioning Flax/JAX Projects	10	1284	July 6, 2021

Image captioning for Indonesia with pre-trained vision