Generative representation learning has seen incredible success in natural language processing, but it remains under-explored in computer vision, where contrastive methods (e.g. SimCLR, MoCo, Swav, DINO, etc.) are dominant. In this project, we will reproduce the state-of-the-art in generative representation learning for vision (BEIT) and hopefully extend upon it.
For this project, we will replicate and hopefully extend BEIT.
The first and most important part of the project will be reproducing the results of BEIT on the tasks of image classification and semantic segmentation. This entails pre-training on ImageNet for 800 epochs and then finetuning on ImageNet/ADE20K for 100 epochs.
The BEIT architecture is straightforward: it is a ViT that predicts visual tokens given by a VQ-VAE. We will use the Huggingface transformers ViT implementation and the VQ-VAE from DALL-E (or possibly the VQ-GAN from the Taming Transformers paper, we shall see). Personally, I have have a lot of experience with these types of models, but I do not have much experience with Jax/Flax, so this will be a learning experience.
The BEIT training procedure, data loading, and image augmentations are similar to those used in popular vision transformer papers (ViT, DeiT, etc.). All hyperparameters are listed in the appendix of the paper.
After reproducing the BEIT results, I have a number of ideas for extending the architecture. First, I’d like to explore ways of combining contrastive self-supervised objectives (e.g. SimCLR, MoCo, DINO, etc.) with BEIT. Appendix D in the BEIT paper explored one way of combining BEIT with DINO, but I think there could be better ways of combining the masked language model and contrastive objectives.
Some short notes on the paper are here.