Hi,
As per the doc: transformers/examples/pytorch/instance-segmentation at main · huggingface/transformers · GitHub
The Instance Segmentation Image Processor for models such as Mask2Former works as follows:
In the Dataset we store :
- The image
- A dual channel mask:
- Channel 1: the ID of the label
- Channel 2: the index of the instance
The image processor will convert this into:
- A pile of masks of shape (N,W,H) where N is the number of instances
- A list of ints of shape (N) that store the label ID of each mask
This is what models such as Mask2Former need.
Then, after inference, the inverse operation is done. The pile of individual masks is converted back to a 2 channel mask.
This is a destructive operation. Instance segmentation is supposed to accept the superposition of instances. With the 2 channels mask, this is not possible anymore as one pixel can only have one instance and one label.
My first test was to adapt the model by deleting the image processor and making my own transformations. However, I also need to use CVAT integration for Hugging Face. If I change the outputs, I also need to adapt the integration.
Do you know if the image processor for instance segmentation can accept a pile of individual instances masks instead of this 2-channel mask?
Best regards