I’ve managed to make it work that way myself:
https://discuss.huggingface.co/t/encoding-masks-for-mask2former-and-panopic-segmentation/110245?u=nlassaux