Image captioning for Japanese with pre-trained vision and text model

valhalla · June 23, 2021, 10:45am

Image captioning for Japanese with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could also use multi-lingual BERT/ROERTa models for the Japanese language.

Datasets

The WIT dataset can be used for this task. It has almost over 800K image-text pairs for Japanese.

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Japanese language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.

(Optional) Links to read upon

This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.

Topic		Replies	Views
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2163	January 4, 2022
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2495	July 19, 2021
Image captioning for Indonesia with pre-trained vision Flax/JAX Projects	4	486	June 29, 2021
Image Captioning fine tuning 🤗Transformers	0	439	February 25, 2023
Image to text model that can take an additional text input 🤗Transformers	1	280	October 2, 2023

Image captioning for Japanese with pre-trained vision and text model