Vision-Language Project Ideas

gchhablani · June 25, 2021, 12:44pm

After some brainstorming, we came with the following project ideas. Would love some feedback/opinion on the same.

ViT + mBART - Multilingual Image Captioning (WIT pre-train)
ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA)
Use CLIP/VQGAN for Image Synthesis. A project has been proposed GIF generation. We can train on a different dataset/domain.

Possible modifications of CLIP/VQGAN:

Multilingual CLIP model for image/sentence matching + Image Generation using VQGAN for this dataset.
FashionCLIP- Train a model for Fashion image-text matching and VQGAN, trained to generate dress/shirt/glasses based on description. Dataset is hard to find, I guess.
Scene generation - CLIP + VQGAN, trained on text-scene dataset for scene (movie/landscape) generation.

Alternatively, can do similar things for video+text, following the example of VideoBERT and other transformers, maybe?

Please comment here if you are interested in collaboration. We are a team of 5 as of now (timezone GMT+5:30). Would also love any suggestions to improve these ideas.

Team:

Thanks,
Gunjan

Vaibhavbrkn · June 25, 2021, 1:26pm

@gchhablani very interesting ideas, I have experience in CV and NLP and would like to contribute in this project.

knilakshan20 · June 26, 2021, 8:25am

@gchhablani
I am interested in VQA project, I have domain experience in both vision and NLP(intermediate).
What kind of expertise you wish to join the iteam if it is not completed?
Thanks

gchhablani · June 26, 2021, 8:48am

Hi @knilakshan20
We will choose one of these topics after discussing. If you’re up for other projects too, it would be great to have you with us. Once we have finalized a project, we will share another post on the forum. Maybe then you can decide, otherwise.

gchhablani · June 26, 2021, 11:16am

Hi @Sasikanth
Is there a specific one you are interested in? Which ones do you think are nice? And what can be improved? Any other suggestions that you have?

We will pick one of these and add another post. If there are enough people who reply here, we can pick two projects from here and work on them separately, wdyt?

lkhphuc · June 26, 2021, 5:07pm

Hi, I’m interested in these projects, especially replicating VideoBERT as that is my research topic. Can I join in with you guys?

bharatR · June 27, 2021, 1:23pm

Hi @gchhablani

I am interested in the ViT + mBERT - Multilingual Visual Question Answering project, i have read a paper (a bit old ,published in 2019) related to this(their model is not limited to VQA task only,but we can finetune it), you can look at here: GitHub - facebookresearch/vilbert-multi-task: Multi Task Vision and Language

knilakshan20 · June 27, 2021, 3:28pm

I am comfortable with image captioning too.
I have some experience intense projects.
I am interested to learn on videbert model.

Dimitre · June 27, 2021, 3:47pm

Hey there I am also very interested in multi-modal models, I liked a lot the first two project ideas.

ViT + mBART - Multilingual Image Captioning (WIT pre-train)
ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA)

I would love to joing the team, if you wanna know a little more about my background check out my GitHub.

valhalla · June 28, 2021, 4:42pm

Hey guys these are great ideas! Would be nice if you could open different threads for different projects and comment there if you want to be a part of that project, this is so that we can keep track of teams and projects. Would be nice to do this before Wednesday

gchhablani · June 28, 2021, 5:16pm

Hi everyone!
Thanks for showing interest. Everyone has different interests/expertise which could be very useful. However, since we only have one week, we can only pursue few ideas.

I am thinking we can pick two of these. I am maintaining this tiny sheet based on which we might be able to divide ourselves into two teams:

Please fill it by tomorrow so the teams and meet and decide on specifics of the project by Wednesday and post them here.

Right now we are a total of 14 people, so I am thinking 7/7 will be ideal, unless some of you aren’t interested in this anymore and have found something better

EDIT:
Please also join the Discord group here: Flax-HuggingFace-Community-Week

Tanweer · June 28, 2021, 6:09pm

I find the ideas very interesting and will like to contribute to both the projects, ViT + mBERT - Multilingual Visual Question Answering (WIT/COCO pre-train, test on VQA/GQA) or CLIP/VQGAN for Image Synthesis. I am really interested in collaborating and contributing in this project.

nilshoehing · June 28, 2021, 9:44pm

Hi, I would also like to join the image captioning or the VQA Topic, if it is still possible. I have some experience with nlp and tensorflow and I am located in GMT+2

valhalla · June 30, 2021, 9:15am

Hi @gchhablani

Please let me know when you have formed the teams and projects so I could add you to the sheet.

Topic		Replies	Views
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1828	July 4, 2021
Multilingual Visual Question Answering Flax/JAX Projects	8	905	July 2, 2021
Multilingual Image Captioning Flax/JAX Projects	10	1286	July 6, 2021
Generate GIF reply to English text with VQGAN + CLIP Flax/JAX Projects	23	3317	July 2, 2021
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	397	June 29, 2021

Vision-Language Project Ideas

Related topics