Image captioning for Spanish with pre-trained vision and text model

Image captioning for Spanish with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.


Pre-trained ViT, BERT models can be found on the model hub. We could use multi-lingual BERT/ROERTa models for the Spanish language.


The WIT dataset can be used for this task. It has almost over 1M image-text pairs for Spanish.

Available training scripts

As this will be a Seq2Seq model, the script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Spanish language. This can be showcased with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.

(Optional) Links to read upon

This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.


I would like to join this project. I am not experienced enough with NLP to create a project on my own. But I do have experience with Computer Vision overall, and would like to try and see how well spanish text is generated for the image-captioning.

I am David and I am in the Amsterdam timezone(CET). I have some experience with other frameworks, and training/evaluating object detection networks, but not with NLP. I can contribute by putting in my previous knowledge and discipline towards achieving this goal :slight_smile:

So you want to create a CLIP like (dual encoder) model? @valhalla

1 Like

Hey @dmatos2012 , don’t worry about experience. We always try to make things easier for everyone and we have a super cool speaker lineup for getting familiar with JAX/Flax/Transformers. And we will try to answer all questions:)

@mrm8488 For image captioning it’ll be more like an encoder-decoder model. The encoder will be an image model and the decoder can be any transformer model with cross-attention which will take hidden_states from image model and will generate text auto-regressively

Awesome @valhalla! thanks

Hey there, this is very interesting, I have some experience with NLP and computer vision, and always wanted to get more experience with multi-modal models (text + vision), also since I saw the WIT dataset for the first time, I wanted to use it for some project, this seems a good opportunity.

If you want to know a little more about my background, check out my GitHub.

1 Like

Great, let’s define the project! @mrm8488 let me know if you want to be added here as well :slight_smile:

1 Like

Hello @valhalla @patrickvonplaten ,
Hope you are doing well…
I am interested in this project as well… So much to learn from doing this project…It is similar to CLIP. But in Clip, we use encoders for both image & text… In this project, we are gonna use encoder for image & decoder for text.
Also, I have a suggestion that we could create a separate thread indexed with all JAX/FLAX learning resources together in 1 place…It will be useful for all of us to learn…
Sri Lakshmi

1 Like

Hey @srisweet !

Added you the team :slight_smile:

Great idea, feel free to create a thread to post JAX/Flax resources!

1 Like

Hi @valhalla . could you add me as well please? I replied at the top of the post :). Thank you!

hi @dmatos2012 you are already part of the team :slight_smile: Here’s the sheet

oh @valhalla apologies, did not see that. Sorry!

Hi @valhalla ,
Just created this thread with JAX/Flax resources…

1 Like

Hi Suraj,
Me and @gchhablani are working on spanish image captioning currently with CLIP + Marian