We’re planning to use ViT encoder, mBART decoder and train them end-to-end for image captioning in different languages.
Model
Pre-trained ViT, mBART (will be merged soon) can be leveraged for our task.
Datasets
Currently we’re thinking of using WIT. COCO dataset can also be used but it has caption only in English language. Dataset can be changed later if we come across more suitable dataset for this task.
Available training scripts
Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
(Optional) Desired project outcome
Our end use case would be to run this model on a video clip or movie and make it like an accessibility tool for visually challenged people. This can be showcased with a streamlit or gradio app.
We’re also planning to benchmark its zero-shot performance on VizWiz
(Optional) Challenges
-Training encoder-decoder model end-to-end in FLAX/JAX will require some effort
-Data Processing incase new datasets are available only in English
4 Likes
I am interested in this task.
Great Idea!
One important thing is to define the scope of the project, specifically which and how many languages to use. We should decide this based on the time and compute constraint so that the project can be finished in time
It’s also possible to use other multilingual models like mBERT of XLM-ROBERTa, but as those are encoder-only models we will need to add a cross-attention layer in it.
I’m officially defining this project!
Added you the team @bhavitvyamalik , @gchhablani , @nilshoehing , @knilakshan20 , @Monda , @rays2pix , @abheesht , @Vaibhavbrkn
Let me know if you have any comments
1 Like
Hey all
I am a beginner with Transformers and FLAX and want to get into Transformers via a project. Basically I would like to be able to work on getting a minimal training pipeline built and a model trained using HuggingFace’s infra
I have sufficient background in CV and Deep Learning (Object detection.Semantic Seg etc via SSD, U-Net) in the autonomous driving space
I do like this idea and would like to sign up. Do you think I can help out in some way?
I would like to work on this Project too.