Image captioning for Japanese with pre-trained vision and text model

Image captioning for Japanese with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/ROERTa models for the Japanese language.

Datasets

The WIT dataset can be used for this task. It has almost over 800K image-text pairs for Japanese.

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Japanese language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.

(Optional) Links to read upon

This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.