Multilingual Visual Question Answering

Multilingual Visual Question Answering

We are currently planning to use ViT + mBert for two modalities namely images and text for Multilingual Visual Question Answering task


Pre-trained models ViT , mBERT , can be used for this task.

4. Datasets

Currently we came across one dataset WIT which is a large multimodal multilingual dataset , We can update the datasets if we encounter more suitable dataset for this task.

5. Training scripts

Since this is a Seq2Seq model, the script can be used for training this model with some modifications.

6. (Optional) Challenges

  • How to combine the two modalities i.e. image and text, some of the ways to this is like VisualBert ,Here we can use ViT for images and use the embeddings from earlier stage for mBERT along with word embeddings. One other way for interaction of two modalities can be done through coattentional transformer layer like in ViLBERT, so we need to decide which would be better suitable for our task.

7. (Optional) Desired project outcome

Final Goal is to have an end to end model where we can perform Visual QA Task in multiple languages.
We are also planning to benchmark the few-Shot/zero-shot performance on VQA/VizWiz/GQA

1 Like

Count me in!

1 Like

Awesome let’s finalize this project :slight_smile:

Great project. Looking to forward to working with you.

1 Like

I want to join this project. I hope I’m not late.

1 Like

Added you guys!

Hey all
I am a beginner with Transformers and FLAX :sweat_smile: and want to get into Transformers via a project. Basically what I would like to take away is - knowledge about what it means to get a pre-trained model tuned to another task using HuggingFace’s infrastructure/pipeline and learn some FLAX basics

I have sufficient background in CV and Deep Learning (Object detection.Semantic Seg etc via SSD, U-Net) in the autonomous driving space using PyTorch
I do like this idea and would like to sign up. Do you think I can help out in some way?

Hi guys, I would also like to be a part of this project.

Hi @patrickvonplaten, can you add me here in this Multilingual Visual QA ?