Multilingual Image Captioning

We’re planning to use ViT encoder, mBART decoder and train them end-to-end for image captioning in different languages.

Model

Pre-trained ViT, mBART (will be merged soon) can be leveraged for our task.

Datasets

Currently we’re thinking of using WIT. COCO dataset can also be used but it has caption only in English language. Dataset can be changed later if we come across more suitable dataset for this task.

Available training scripts

Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

Our end use case would be to run this model on a video clip or movie and make it like an accessibility tool for visually challenged people. This can be showcased with a streamlit or gradio app.

We’re also planning to benchmark its zero-shot performance on VizWiz

(Optional) Challenges

-Training encoder-decoder model end-to-end in FLAX/JAX will require some effort
-Data Processing incase new datasets are available only in English

4 Likes

I am in :stuck_out_tongue:

@abheesht @Vaibhavbrkn @nilshoehing have also shown interest in this sheet: JAX/Flax Vision-and-Language Projects - Google Sheets

Others are welcome to join! :slight_smile:

1 Like

Yes, I am also in!

1 Like

I am also in :wink:

1 Like

consider me on this :slight_smile:

I am interested in this task.

We can also use this dataset GitHub - bharathichezhiyan/multimodalmachinetranslation-Tamil: Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription

Yep, I’m in, @bhavitvyamalik, @gchhablani :slight_smile:

1 Like

Great Idea!

One important thing is to define the scope of the project, specifically which and how many languages to use. We should decide this based on the time and compute constraint so that the project can be finished in time :slight_smile:

It’s also possible to use other multilingual models like mBERT of XLM-ROBERTa, but as those are encoder-only models we will need to add a cross-attention layer in it.

I’m officially defining this project!
Added you the team @bhavitvyamalik , @gchhablani , @nilshoehing , @knilakshan20 , @Monda , @rays2pix , @abheesht , @Vaibhavbrkn

Let me know if you have any comments :slight_smile:

1 Like

Hey all
I am a beginner with Transformers and FLAX :sweat_smile: and want to get into Transformers via a project. Basically I would like to be able to work on getting a minimal training pipeline built and a model trained using HuggingFace’s infra

I have sufficient background in CV and Deep Learning (Object detection.Semantic Seg etc via SSD, U-Net) in the autonomous driving space
I do like this idea and would like to sign up. Do you think I can help out in some way?

I would like to work on this Project too.