Reproducing and Extending BEIT

greeneggsandyaml · July 2, 2021, 11:31am

Motivation

Generative representation learning has seen incredible success in natural language processing, but it remains under-explored in computer vision, where contrastive methods (e.g. SimCLR, MoCo, Swav, DINO, etc.) are dominant. In this project, we will reproduce the state-of-the-art in generative representation learning for vision (BEIT) and hopefully extend upon it.

Description

For this project, we will replicate and hopefully extend BEIT.

The first and most important part of the project will be reproducing the results of BEIT on the tasks of image classification and semantic segmentation. This entails pre-training on ImageNet for 800 epochs and then finetuning on ImageNet/ADE20K for 100 epochs.

The BEIT architecture is straightforward: it is a ViT that predicts visual tokens given by a VQ-VAE. We will use the Huggingface transformers ViT implementation and the VQ-VAE from DALL-E (or possibly the VQ-GAN from the Taming Transformers paper, we shall see). Personally, I have have a lot of experience with these types of models, but I do not have much experience with Jax/Flax, so this will be a learning experience.

The BEIT training procedure, data loading, and image augmentations are similar to those used in popular vision transformer papers (ViT, DeiT, etc.). All hyperparameters are listed in the appendix of the paper.

After reproducing the BEIT results, I have a number of ideas for extending the architecture. First, I’d like to explore ways of combining contrastive self-supervised objectives (e.g. SimCLR, MoCo, DINO, etc.) with BEIT. Appendix D in the BEIT paper explored one way of combining BEIT with DINO, but I think there could be better ways of combining the masked language model and contrastive objectives.

Notes on the Paper

Some short notes on the paper are here.

patrickvonplaten · July 4, 2021, 11:24am

Officially defining it!

greeneggsandyaml · July 4, 2021, 11:38pm

Great, welcome to the project! I’ve created a #beit channel on Discord for us – come join and we’ll chat there.

cishwarya · July 9, 2021, 9:19pm

@greeneggsandyaml would like to join the discussion, can you please send the link to the discord channel?

unilm · July 24, 2021, 8:05am

The official implementation can be found at unilm/beit at master · microsoft/unilm (github.com)

Topic		Replies	Views
Vision-Language Project Ideas Flax/JAX Projects	13	1549	June 30, 2021
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1828	July 4, 2021
DALL-E - mini version Flax/JAX Projects	52	8567	August 22, 2021
CLIP like contrastive vision-language models for Spanish with pre-trained text and vision models Flax/JAX Projects	4	397	June 29, 2021
Multilingual Visual Question Answering Flax/JAX Projects	8	905	July 2, 2021

Reproducing and Extending BEIT

Motivation

Description

Notes on the Paper

Related topics