Documentation script for fine-tuning Mask2Former with Trainer does not support instance segmentation with superposed instances

I have a use case of Mask2Former where I must superpose image annotations.

One example is I have to detect “cracks” and “reparation”… As well as “cracks” within “reparations”. So I will have “cracks” instances superposed to “reparation” instances.

This is part of the Instance Segmentation paradigm and models Mask2Former support it natively.

The documentation suggests storing the mask in a single image :

  • The Red channel is used to store the Label ID
  • The Green channel is used to store the Instance ID

Ref : https://github.com/huggingface/transformers/tree/main/examples/pytorch/instance-segmentation

This does not allow the superposition of instances, as one pixel can only have one class and one instance.

Then this is decoded by AutoImageProcessor into mask_labels and class_labels. Where mask_labels is (if no mistake) an NxWxH binary tensor with N being the number of instances and class_labels is an N, integers, tensor, storing the Label ID of each instance.

I can see the workaround. In my Hugging Face dataset, I shall create:

  • A column for the image
  • A column for the list of instances masks (i.e. a NxWxH tensor)
  • A column for the list of Labels ID of each Instance.

Like many other users, I can only export instance segmentation annotation from my annotation software in standard formats such as COCO.

This means that my process is now…

  • Export my annotations in COCO or a similar format
  • Remix the Coco’s .json structure in HuggingFace .jsonl structure
  • Store that in the Hugging Face Dataset
  • Load the dataset in the training script
  • Decode the masks’ RLE into NxWxH and N tensors
  • Apply my transformations on the tensors
  • Feed the “Training” class

I’m a bit surprised because the initial implementation of Mask2Former in Detectron2 natively supports loading COCO datasets.

Has this been deleted in the Hugging Face implementation or am I missing something?

1 Like

Hmm… it seems that your understanding is correct.

1 Like

I have adapted the dataset and the run_instance_segmentation.py script but it is not straightforward…

In the dataset I store the data like this :

  • Image
  • List of masks RLE string
  • List of the corresponding class for each mask

Then I modified how the data is loaded in the augment_and_transform_batch function so that I decode the RLE and create a stack of masks.

There, I did not find how to feed AutoImageProcessor with a stack of masks so I removed it and made my own.

While this seems simple, there are lots of traps.

If someone has a better solution…

1 Like

The big con of this solution is that I actually create a non-standard Hugging Face model, by removing the image processor. As I want to use Hugging Face integration to CVAT afterwards, it means I also need to adapt the integration itself…

1 Like