Multilingual Image Captioning

bhavitvyamalik · June 29, 2021, 6:02pm

We’re planning to use ViT encoder, mBART decoder and train them end-to-end for image captioning in different languages.

Model

Pre-trained ViT, mBART (will be merged soon) can be leveraged for our task.

Datasets

Currently we’re thinking of using WIT. COCO dataset can also be used but it has caption only in English language. Dataset can be changed later if we come across more suitable dataset for this task.

Available training scripts

Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

Our end use case would be to run this model on a video clip or movie and make it like an accessibility tool for visually challenged people. This can be showcased with a streamlit or gradio app.

We’re also planning to benchmark its zero-shot performance on VizWiz

(Optional) Challenges

-Training encoder-decoder model end-to-end in FLAX/JAX will require some effort
-Data Processing incase new datasets are available only in English

gchhablani · June 29, 2021, 6:07pm

I am in

@abheesht @Vaibhavbrkn @nilshoehing have also shown interest in this sheet: JAX/Flax Vision-and-Language Projects - Google Sheets

Others are welcome to join!

nilshoehing · June 29, 2021, 6:19pm

Yes, I am also in!

knilakshan20 · June 30, 2021, 3:45am

I am also in

Monda · June 30, 2021, 4:10am

consider me on this

rays2pix · June 30, 2021, 4:48am

I am interested in this task.

rays2pix · June 30, 2021, 4:50am

We can also use this dataset GitHub - bharathichezhiyan/multimodalmachinetranslation-Tamil: Multilingual Multimodal Machine Translation for Dravidian Languages utilizing Phonetic Transcription

abheesht · June 30, 2021, 8:45am

Yep, I’m in, @bhavitvyamalik, @gchhablani

valhalla · June 30, 2021, 10:13am

Great Idea!

One important thing is to define the scope of the project, specifically which and how many languages to use. We should decide this based on the time and compute constraint so that the project can be finished in time

It’s also possible to use other multilingual models like mBERT of XLM-ROBERTa, but as those are encoder-only models we will need to add a cross-attention layer in it.

I’m officially defining this project!
Added you the team @bhavitvyamalik , @gchhablani , @nilshoehing , @knilakshan20 , @Monda , @rays2pix , @abheesht , @Vaibhavbrkn

Let me know if you have any comments

asharma85 · July 2, 2021, 11:13am

Hey all
I am a beginner with Transformers and FLAX and want to get into Transformers via a project. Basically I would like to be able to work on getting a minimal training pipeline built and a model trained using HuggingFace’s infra

I have sufficient background in CV and Deep Learning (Object detection.Semantic Seg etc via SSD, U-Net) in the autonomous driving space
I do like this idea and would like to sign up. Do you think I can help out in some way?

tejas2811 · July 6, 2021, 12:01pm

I would like to work on this Project too.

Topic		Replies	Views
Image captioning for Indonesia with pre-trained vision Flax/JAX Projects	4	485	June 29, 2021
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2159	January 4, 2022
Multilingual Visual Question Answering Flax/JAX Projects	8	905	July 2, 2021
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1165	June 23, 2021
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2477	July 19, 2021