Image captioning for Spanish with pre-trained vision and text model
For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.
Model
Pre-trained ViT, BERT models can be found on the model hub. We could use multi-lingual BERT/ROERTa models for the Spanish language.
Datasets
The WIT dataset can be used for this task. It has almost over 1M image-text pairs for Spanish.
Available training scripts
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
(Optional) Desired project outcome
The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in for the Spanish language. This can be showcased with a streamlit or gradio app.
(Optional) Challenges
This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.
(Optional) Links to read upon
This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.
Hi,
I would like to join this project. I am not experienced enough with NLP to create a project on my own. But I do have experience with Computer Vision overall, and would like to try and see how well spanish text is generated for the image-captioning.
I am David and I am in the Amsterdam timezone(CET). I have some experience with other frameworks, and training/evaluating object detection networks, but not with NLP. I can contribute by putting in my previous knowledge and discipline towards achieving this goal
Hey @dmatos2012 , don’t worry about experience. We always try to make things easier for everyone and we have a super cool speaker lineup for getting familiar with JAX/Flax/Transformers. And we will try to answer all questions:)
@mrm8488 For image captioning it’ll be more like an encoder-decoder model. The encoder will be an image model and the decoder can be any transformer model with cross-attention which will take hidden_states from image model and will generate text auto-regressively
Hey there, this is very interesting, I have some experience with NLP and computer vision, and always wanted to get more experience with multi-modal models (text + vision), also since I saw the WIT dataset for the first time, I wanted to use it for some project, this seems a good opportunity.
If you want to know a little more about my background, check out my GitHub.
Hello @valhalla@patrickvonplaten ,
Hope you are doing well…
I am interested in this project as well… So much to learn from doing this project…It is similar to CLIP. But in Clip, we use encoders for both image & text… In this project, we are gonna use encoder for image & decoder for text.
Also, I have a suggestion that we could create a separate thread indexed with all JAX/FLAX learning resources together in 1 place…It will be useful for all of us to learn…
Cheers,
Sri Lakshmi