Image captioning for French with pre-trained vision and text model

Image captioning with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.


Pre-trained ViT, BERT, and GPT2 models can be found on the model hub


The WIT dataset can be used for this task. It has almost over 1M image-text examples for the French language.

Available training scripts

As this will be a Seq2Seq model, the script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in French languages. This can be showcased directly with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.

(Optional) Links to read upon

This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.

1 Like

Hi, I am a bit late for deciding the project to work on. This one seems a good fit for me. Is it possible?
And one more question, there is another project named “Multilingual Image Captioning”. It has already 8 participants, so I don’t expect to be a new member for that project. However, would it be possible to join their discussions?

I leave a comment on the google sheet as requested by @patrickvonplaten

1 Like

Would be great if we could find another participant here :slight_smile: Will see if we can make this a single-person project tomorrow morning :slight_smile:

Let’s define it!

1 Like

@valhalla @patrickvonplaten It seems CamemBERT doesn’t have a Flax version. GPT2 Fr models don’t have neither.

Should I write a Flax version for CamemBERT and try to reuse PyTorch weights for it?

excuse me does the decoder of the language model deal with words or sentences to do the captioning . i mean did you use BERT for word embedding or sentence embedding … Thanks in advance