Swin Transformer for segmentation

Hawsh · October 23, 2022, 2:22am

Hello, I have a couple of questions concerning the Swin transformer model.
1- Other vision models like VIT and BEIT have a class for semantic segmentation tasks. Why isn’t there a SwinForSemanticSegmentation ? And If I wanted to attach a segmentation head to the model, how can I do that using your library?

2- There is also SwinForMaskedImageModeling which does Masked image modeling which I believe is the whole premise behind the BEiT model. So theoretically I can train BEiT with a Swin backbone using this class ?

Thank you.

nielsr · November 3, 2022, 11:28am

Hi,

1- Other vision models like VIT and BEIT have a class for semantic segmentation tasks. Why isn’t there a SwinForSemanticSegmentation ? And If I wanted to attach a segmentation head to the model, how can I do that using your library?

We still need to add SwinForSemanticSegmentation to the library. For now I’d recommend using SegFormer as explained in this blog post. I’ll open an issue to add it to the library!

2- There is also SwinForMaskedImageModeling which does Masked image modeling which I believe is the whole premise behind the BEiT model. So theoretically I can train BEiT with a Swin backbone using this class ?

Actually, Swin and BEiT have slightly different objectives for masked image modeling. Swin is pretty simple: you mask out some patches of the input image, and the model needs to predict the raw pixel values for them. This method is called SimMIM.

BEiT on the other hand predicts token IDs from the codebook of a pre-trained VQ-VAE (namely, the VQ-VAE of DALL-E 1) for the masked patches. As BEiT has its own specific pre-training objective, it’s not supported by the AutoModelForMaskedImageModeling class. The latter supports Swin, Swinv2, ViT and DeiT. They can all be pre-trained using this example script.

Topic		Replies	Views
How to properly train BEiT for Masked Image Modeling Intermediate	0	943	March 7, 2022
BEiT Semantic Segmentation Model Performance Low 🤗Transformers	0	146	October 30, 2023
Visual Tokenization / Masking In BEIT & LayoutLMv3 Intermediate	1	542	December 23, 2022
Reproducing and Extending BEIT Flax/JAX Projects	4	1209	July 24, 2021
Standard Procedure for adapting vision encoders to semantic segmentation Models	0	222	September 5, 2023

Swin Transformer for segmentation

Related topics