For this project, a pre-trained image model like
ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Indonesia language. This can be showcased with a streamlit or gradio app.
-This model will require some modifications to the existing text models.