Multilingual Visual Question Answering
We are currently planning to use ViT + mBert for two modalities namely images and text for Multilingual Visual Question Answering task
Model
Pre-trained models ViT , mBERT , can be used for this task.
4. Datasets
Currently we came across one dataset WIT which is a large multimodal multilingual dataset , We can update the datasets if we encounter more suitable dataset for this task.
5. Training scripts
Since this is a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
6. (Optional) Challenges
- How to combine the two modalities i.e. image and text, some of the ways to this is like VisualBert ,Here we can use ViT for images and use the embeddings from earlier stage for mBERT along with word embeddings. One other way for interaction of two modalities can be done through coattentional transformer layer like in ViLBERT, so we need to decide which would be better suitable for our task.
7. (Optional) Desired project outcome
Final Goal is to have an end to end model where we can perform Visual QA Task in multiple languages.
We are also planning to benchmark the few-Shot/zero-shot performance on VQA/VizWiz/GQA