Vision-Language Project Ideas

After some brainstorming, we came with the following project ideas. Would love some feedback/opinion on the same.

  • ViT + mBART - Multilingual Image Captioning (WIT pre-train)
  • ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA)
  • Use CLIP/VQGAN for Image Synthesis. A project has been proposed GIF generation. We can train on a different dataset/domain.

Possible modifications of CLIP/VQGAN:

  • Multilingual CLIP model for image/sentence matching + Image Generation using VQGAN for this dataset.
  • FashionCLIP- Train a model for Fashion image-text matching and VQGAN, trained to generate dress/shirt/glasses based on description. Dataset is hard to find, I guess.
  • Scene generation - CLIP + VQGAN, trained on text-scene dataset for scene (movie/landscape) generation.

Alternatively, can do similar things for video+text, following the example of VideoBERT and other transformers, maybe?

Please comment here if you are interested in collaboration. We are a team of 5 as of now (timezone GMT+5:30). Would also love any suggestions to improve these ideas.

Team:

Thanks,
Gunjan

7 Likes

@gchhablani very interesting ideas, I have experience in CV and NLP and would like to contribute in this project.

1 Like

@gchhablani
I am interested in VQA project, I have domain experience in both vision and NLP(intermediate).
What kind of expertise you wish to join the iteam if it is not completed?
Thanks

1 Like

Hi @knilakshan20
We will choose one of these topics after discussing. If you’re up for other projects too, it would be great to have you with us. Once we have finalized a project, we will share another post on the forum. Maybe then you can decide, otherwise.

Hi @gchhablani ,

This project (all three) sounds great. I am interested to contribute and participate in this project. I have expertise in both Vision and NLP areas, although not directly with ViT/CLIP/VQGAN. I am hoping joining in this project gives me an opportunity to learn by doing.

1 Like

Hi @Sasikanth
Is there a specific one you are interested in? Which ones do you think are nice? And what can be improved? Any other suggestions that you have?

We will pick one of these and add another post. If there are enough people who reply here, we can pick two projects from here and work on them separately, wdyt?

Hi @gchhablani ,

I think the following two are interesting ones to go with :

  • ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA)
  • modifications of CLIP/VQGAN: Scene generation - CLIP + VQGAN, trained on text-scene dataset for scene (movie/landscape) generation / video+text, following the example of VideoBERT

I am just thinking what can be the dataset for scene generation. Should we start looking at VideoBERT as well for possible datasets used there? Are there any references/material which we can start looking into ? Please advise.

1 Like

Hi, I’m interested in these projects, especially replicating VideoBERT as that is my research topic. Can I join in with you guys?

1 Like

Hi @gchhablani

I am interested in the ViT + mBERT - Multilingual Visual Question Answering project, i have read a paper (a bit old ,published in 2019) related to this(their model is not limited to VQA task only,but we can finetune it), you can look at here: GitHub - facebookresearch/vilbert-multi-task: Multi Task Vision and Language

2 Likes

I am comfortable with image captioning too.
I have some experience intense projects.
I am interested to learn on videbert model.

1 Like

Hey there I am also very interested in multi-modal models, I liked a lot the first two project ideas.

  • ViT + mBART - Multilingual Image Captioning (WIT pre-train)
  • ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA)

I would love to joing the team, if you wanna know a little more about my background check out my GitHub.

2 Likes

Hi @gchhablani
I find the ideas very interesting and will like to contribute to both the projects, ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA) or CLIP/VQGAN for Image Synthesis. Please let me know if I could be part of the team.

1 Like

hey Gunjan, i would like to be added in this project, this looks really interesting and i would love to share my expertise in this project too

1 Like

Hey guys these are great ideas! Would be nice if you could open different threads for different projects and comment there if you want to be a part of that project, this is so that we can keep track of teams and projects. Would be nice to do this before Wednesday :slight_smile:

1 Like

Hi everyone!
Thanks for showing interest. Everyone has different interests/expertise which could be very useful. However, since we only have one week, we can only pursue few ideas.

I am thinking we can pick two of these. I am maintaining this tiny sheet based on which we might be able to divide ourselves into two teams:

Please fill it by tomorrow so the teams and meet and decide on specifics of the project by Wednesday and post them here.

Right now we are a total of 14 people, so I am thinking 7/7 will be ideal, unless some of you aren’t interested in this anymore and have found something better :slight_smile:

EDIT:
Please also join the Discord group here: Flax-HuggingFace-Community-Week

1 Like

I find the ideas very interesting and will like to contribute to both the projects, ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA) or CLIP/VQGAN for Image Synthesis. I am really interested in collaborating and contributing in this project.

2 Likes

Hi, I would also like to join the image captioning or the VQA Topic, if it is still possible. I have some experience with nlp and tensorflow and I am located in GMT+2

1 Like

Hi @gchhablani

Please let me know when you have formed the teams and projects so I could add you to the sheet.

1 Like