Mask2former setup for binary segmentation

Hello,

I am trying to fine-tune a mask2former model for a binary task where 0 is the background and 1 is my object. I am initializing the processor and model in the following way:

IMAGE_PROCESSOR = Mask2FormerImageProcessor.from_pretrained("facebook/mask2former-swin-base-IN21k-ade-semantic"
                        , do_rescale = False, do_normalize = True
                        , do_resize = False
                        , num_labels = 2, ignore_index = 0
                        )

MODEL = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-base-IN21k-ade-semantic", num_labels = 2, ignore_mismatched_sizes = True)

My dataset randomly rotates and flips my training and validation data and it also takes a RandomCrop of the image and masks. This setup has already worked for training a LRASPP-model from the ground up.
However every time I try to train the mask2former-model after a few epochs the accuracy converges to 0 and it starts outputting only null-tensors.
Is something with my initialization wrong or not? I have already looked for different threads about this but there aren’t many. On Huggingface there is one from a year ago but this was never answered.

I will try if implementing something that guarantees that the input will never be a null-tensor is going to improve it but at this point I am not very hopeful.

I posted this problem with more information in the “Models”-category since it fit better in there. This thread can be closed.

Hi @beschmitt, did you try to train using do_reduce_labels=True + ignore_index=255 instead?

I have tried using do_reduce_labels at one point but that lead to it always returning a null-tensor.
I set ignore_index = 0 because my background is labeled as 0 and the object I want to detect is set to 1. When I read them in the background is 0 and the object is 255 but I am dividing the mask-tensor by 255 to get it to 0 and 1 because I had trouble with other models before without doing this.

I will try your suggestion though and post my answer.

Try following this code snippet.

Please pay attention that your binary mask should contain 255 as background and do_reduce_labels is set to False.

import requests
import torch
import numpy as np
from PIL import Image
from transformers import Mask2FormerForUniversalSegmentation, Mask2FormerImageProcessor


# load Mask2Former fine-tuned on Mapillary Vistas semantic segmentation
processor: Mask2FormerImageProcessor = Mask2FormerImageProcessor.from_pretrained(
    "facebook/mask2former-swin-tiny-coco-instance",
    do_reduce_labels=False,
    ignore_index=255,
)

id2label = {
    0: "cat",  # relevant classes ids must start from 0
    255: "background",
}
label2id = {v: k for k, v in id2label.items()}

model = Mask2FormerForUniversalSegmentation.from_pretrained(
    "facebook/mask2former-swin-large-mapillary-vistas-semantic",
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
w, h = image.size

# create a dummy instance map with two instances, instance ids are "5" and "10"
instances_map = np.full((h, w), 255, dtype=np.uint8)
instances_map[: h // 2, :] = 5
instances_map[h // 2:, :] = 10

# map instance ids to semantic ids,
instance_id_to_semantic_id = {5: 0, 10: 0}

# prepare inputs, image processor will create two binary masks, one for each instance
inputs = processor(images=[image], segmentation_maps=[instances_map], instance_id_to_semantic_id=instance_id_to_semantic_id, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# you can pass them to processor for postprocessing
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]

Instance segmentation example:

In case you just need semantic segmentation (not instance) I would recommend looking at the Segformer example

I have set it up so that the background is 255 and the object is 0. ignore_index is set to 255 and reduce_labels is set to False. I even implemented it so that every randomcrop has at least several thousand pixels of the object inside it to prevent it from learning empty tensors and it is still not working.

Hi @beschmitt
The above example I provided seems to work without an error, however, you can avoid using an image processor if it doesn’t work for you and just prepare model input by yourself, just make sure it is in the same format model expected.

First, you can take a look at the officially provided example (see the links above), then prepare your input to be in the same format.