Image captioning for Spanish with pre-trained vision and text model

valhalla · June 23, 2021, 10:33am

Image captioning for Spanish with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.

Model

Pre-trained ViT, BERT models can be found on the model hub. We could use multi-lingual BERT/ROERTa models for the Spanish language.

Datasets

The WIT dataset can be used for this task. It has almost over 1M image-text pairs for Spanish.

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Spanish language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.

(Optional) Links to read upon

This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.

dmatos2012 · June 23, 2021, 3:48pm

Hi,
I would like to join this project. I am not experienced enough with NLP to create a project on my own. But I do have experience with Computer Vision overall, and would like to try and see how well spanish text is generated for the image-captioning.

I am David and I am in the Amsterdam timezone(CET). I have some experience with other frameworks, and training/evaluating object detection networks, but not with NLP. I can contribute by putting in my previous knowledge and discipline towards achieving this goal

mrm8488 · June 23, 2021, 3:54pm

So you want to create a CLIP like (dual encoder) model? @valhalla

valhalla · June 23, 2021, 5:13pm

Hey @dmatos2012 , don’t worry about experience. We always try to make things easier for everyone and we have a super cool speaker lineup for getting familiar with JAX/Flax/Transformers. And we will try to answer all questions:)

@mrm8488 For image captioning it’ll be more like an encoder-decoder model. The encoder will be an image model and the decoder can be any transformer model with cross-attention which will take hidden_states from image model and will generate text auto-regressively

dmatos2012 · June 23, 2021, 9:29pm

Awesome @valhalla! thanks

Dimitre · June 27, 2021, 12:44am

Hey there, this is very interesting, I have some experience with NLP and computer vision, and always wanted to get more experience with multi-modal models (text + vision), also since I saw the WIT dataset for the first time, I wanted to use it for some project, this seems a good opportunity.

If you want to know a little more about my background, check out my GitHub.

patrickvonplaten · June 29, 2021, 3:24pm

Great, let’s define the project! @mrm8488 let me know if you want to be added here as well

srisweet · June 30, 2021, 3:26am

Hello @valhalla @patrickvonplaten ,
Hope you are doing well…
I am interested in this project as well… So much to learn from doing this project…It is similar to CLIP. But in Clip, we use encoders for both image & text… In this project, we are gonna use encoder for image & decoder for text.
Also, I have a suggestion that we could create a separate thread indexed with all JAX/FLAX learning resources together in 1 place…It will be useful for all of us to learn…
Cheers,
Sri Lakshmi

valhalla · June 30, 2021, 8:00am

Hey @srisweet !

Added you the team

Great idea, feel free to create a thread to post JAX/Flax resources!

dmatos2012 · June 30, 2021, 8:41am

Hi @valhalla . could you add me as well please? I replied at the top of the post :). Thank you!

valhalla · June 30, 2021, 8:58am

hi @dmatos2012 you are already part of the team Here’s the sheet

dmatos2012 · June 30, 2021, 9:00am

oh @valhalla apologies, did not see that. Sorry!

srisweet · June 30, 2021, 9:54am

Hi @valhalla ,
Just created this thread with JAX/Flax resources…

bhavitvyamalik · July 19, 2021, 8:13am

Hi Suraj,
Me and @gchhablani are working on spanish image captioning currently with CLIP + Marian

Topic		Replies	Views
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2163	January 4, 2022
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1173	June 23, 2021
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	397	June 29, 2021
Multilingual Image Captioning Flax/JAX Projects	10	1286	July 6, 2021
Image captioning for Indonesia with pre-trained vision Flax/JAX Projects	4	486	June 29, 2021

Image captioning for Spanish with pre-trained vision and text model