Mask2Former not performing as expected


I am working on a project to see how some models/architectures perform with my custom dataset for semantic segmentation. I want to train my models from scratch with no pre trained weights. I am comparing models like ResNet50, Segfromer and Mask2Former. I load all my images and masks using DataLoader. For ResNet50, I use the ResNet50 FCN provided by torchvision: fcn_resnet50 — Torchvision main documentation

So for Mask2Former from scratch. I am doing this but the results are terrible, even worse than the ResNet50 FCN. I am only using the Mask2Former model and nothing else like the ImageProcessor.

    configuration = Mask2FormerConfig(**dict(arch['args']))
    configuration.num_queries = data['num_classes']
    model = Mask2FormerForUniversalSegmentation(configuration)
for _, images, masks in dataloader:
    images =, non_blocking=True)
    masks =, non_blocking=True)
    outputs = model(images,
        outputs = outputs.masks_queries_logits
        outputs = nn.functional.interpolate(outputs, size=masks.shape[-2:], mode="bilinear", align_corners=False)

I also am trying with pretrained weights, but get the same results.

model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-small-ade-semantic", num_queries=data['num_classes'], ignore_mismatched_sizes=True)

I get this warning with the pretrained weights

Some weights of  were not initialized from the model checkpoint at facebook/mask2former-swin-small-ade-semantic and are newly initialized because the shapes did not match:

Does anyone have an idea about this? Really stuck currently. Would appreciate any advice.


What’s the reason you’d like to train the model from scratch? Note that training from scratch requires quite some compute, e.g. the authors used 8 V100 GPU’s for that. It might be beneficial to just fine-tune the head layers for your custom dataset.

Refer to these tutorial notebooks regarding fine-tuning: Transformers-Tutorials/MaskFormer at master · NielsRogge/Transformers-Tutorials · GitHub (Mask2Former fine-tuning is identical to MaskFormer fine-tuning).

1 Like

Hello @nielsr,

Thanks for the response.

As mentioned above, I am comparing several different architectures. I am training for scratch because not all of these models have pre-trained weights available. Also, I’ve read that pretrained weights and fineturning is only very effective when the datasets are in the same domain. But my dataset is not similar to the pretrained ade dataset.

Regardless, I have tried to implement the fine-tuning tutorial that you linked with my dataset, but have a few questions.

  1. Loss function: Mask2Former returns the Mask2FormerForUniversalSegmentationOutput as output. It has its own loss with it. How does it calculate the loss? I want to use my own loss function (Dice Loss) to calculate the loss and use it for the backpropagation. How can I get the raw logits in the form of (batch, num_classes, height, width) to input into my loss function?

  2. Metrics: One weird thing that happens with the metrics defined in the tutorial is that when it calculated the ioupercategory, my iou for my background class is always 0. I want it to include my background class (label = 0) in the iou calculation.

I don’t think num_queries should be num_classes, I think num_queries refers to the number of objects to detect. you can see these are separate things in the post_process_instance_segmentation functionfor Mask2Former

also maybe check out the reduce_labels arg to handle background correctly? I’m trying this out curious if this solve syour iou problem: Train a MaskFormer Segmentation Model with Hugging Face Transformers - PyImageSearch

following the tutorial @nielsr linked, you also need to set ignore_index to 0 in the processor so that the background class isn’t picked up as an object class

Hi @rbavery @nielsr

Is it clear for you how to use ignore_index, and reduce_labels arguments?

I’m trying to do binary segmentation using Mask2FormerForUniversalSegmentation.from_pretrained, and I’m also following exactly the Tutorial you already mention.

I have a single class and I’m not able to make the model converge. Does Mask2Former support binary segmentation, or should treat this as 2-classes segmentation (background and my class). If fo, should I consider something special in the configuration, eg. ignore_index, reduce_labels?

Appreaciate your help!