I am working on a project to see how some models/architectures perform with my custom dataset for semantic segmentation. I want to train my models from scratch with no pre trained weights. I am comparing models like ResNet50, Segfromer and Mask2Former. I load all my images and masks using DataLoader. For ResNet50, I use the ResNet50 FCN provided by torchvision: fcn_resnet50 — Torchvision main documentation
So for Mask2Former from scratch. I am doing this but the results are terrible, even worse than the ResNet50 FCN. I am only using the Mask2Former model and nothing else like the ImageProcessor.
configuration = Mask2FormerConfig(**dict(arch['args']))
configuration.num_queries = data['num_classes']
model = Mask2FormerForUniversalSegmentation(configuration)
I also am trying with pretrained weights, but get the same results.
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-small-ade-semantic", num_queries=data['num_classes'], ignore_mismatched_sizes=True)
I get this warning with the pretrained weights
Some weights of were not initialized from the model checkpoint at facebook/mask2former-swin-small-ade-semantic and are newly initialized because the shapes did not match:
What’s the reason you’d like to train the model from scratch? Note that training from scratch requires quite some compute, e.g. the authors used 8 V100 GPU’s for that. It might be beneficial to just fine-tune the head layers for your custom dataset.
As mentioned above, I am comparing several different architectures. I am training for scratch because not all of these models have pre-trained weights available. Also, I’ve read that pretrained weights and fineturning is only very effective when the datasets are in the same domain. But my dataset is not similar to the pretrained ade dataset.
Regardless, I have tried to implement the fine-tuning tutorial that you linked with my dataset, but have a few questions.
Loss function: Mask2Former returns the Mask2FormerForUniversalSegmentationOutput as output. It has its own loss with it. How does it calculate the loss? I want to use my own loss function (Dice Loss) to calculate the loss and use it for the backpropagation. How can I get the raw logits in the form of (batch, num_classes, height, width) to input into my loss function?
Metrics: One weird thing that happens with the metrics defined in the tutorial is that when it calculated the ioupercategory, my iou for my background class is always 0. I want it to include my background class (label = 0) in the iou calculation.
I don’t think num_queries should be num_classes, I think num_queries refers to the number of objects to detect. you can see these are separate things in the post_process_instance_segmentation functionfor Mask2Former
following the tutorial @nielsr linked, you also need to set ignore_index to 0 in the processor so that the background class isn’t picked up as an object class
Is it clear for you how to use ignore_index, and reduce_labels arguments?
I’m trying to do binary segmentation using Mask2FormerForUniversalSegmentation.from_pretrained, and I’m also following exactly the Tutorial you already mention.
I have a single class and I’m not able to make the model converge. Does Mask2Former support binary segmentation, or should treat this as 2-classes segmentation (background and my class). If fo, should I consider something special in the configuration, eg. ignore_index, reduce_labels?
Hello @manuCeron96 ,
have you figured out the correct setup for binary segmentation?
I have my image_processor set to num_labels = 2 and ignore_index = 0 since index is my background-label and 1 is my object-label
My model is set to num_labels = 2 and ignore_mismatched_sizes = True
The fine-tuning starts out promising but every time the accuracy starts converging to 0 and the model then only predicts null-vectors. I’d love to have a clearer tutorial about how to set up the different parameters for the processor and model for specific use-cases.