Swin Transformer for segmentation

Hello, I have a couple of questions concerning the Swin transformer model.
1- Other vision models like VIT and BEIT have a class for semantic segmentation tasks. Why isn’t there a SwinForSemanticSegmentation ? And If I wanted to attach a segmentation head to the model, how can I do that using your library?

2- There is also SwinForMaskedImageModeling which does Masked image modeling which I believe is the whole premise behind the BEiT model. So theoretically I can train BEiT with a Swin backbone using this class ?

Thank you.

Hi,

1- Other vision models like VIT and BEIT have a class for semantic segmentation tasks. Why isn’t there a SwinForSemanticSegmentation ? And If I wanted to attach a segmentation head to the model, how can I do that using your library?

We still need to add SwinForSemanticSegmentation to the library. For now I’d recommend using SegFormer as explained in this blog post. I’ll open an issue to add it to the library!

2- There is also SwinForMaskedImageModeling which does Masked image modeling which I believe is the whole premise behind the BEiT model. So theoretically I can train BEiT with a Swin backbone using this class ?

Actually, Swin and BEiT have slightly different objectives for masked image modeling. Swin is pretty simple: you mask out some patches of the input image, and the model needs to predict the raw pixel values for them. This method is called SimMIM.

BEiT on the other hand predicts token IDs from the codebook of a pre-trained VQ-VAE (namely, the VQ-VAE of DALL-E 1) for the masked patches. As BEiT has its own specific pre-training objective, it’s not supported by the AutoModelForMaskedImageModeling class. The latter supports Swin, Swinv2, ViT and DeiT. They can all be pre-trained using this example script.

1 Like