Mask2former with low rank adaptation setup for binary segmentation

I have previously posted this topic in the Beginner category but I feel that it fits more as a general model question. Also there is new information and I am not able to edit questions yet.

Hello,

I am trying to fine-tune a mask2former model for a binary task where 0 is the background and 1 is my object. I am initializing the processor and model in the following way:

from transformers import Mask2FormerImageProcessor, Mask2FormerForUniversalSegmentation

IMAGE_PROCESSOR = Mask2FormerImageProcessor.from_pretrained("facebook/mask2former-swin-base-IN21k-ade-semantic"
                        , do_rescale = False, do_normalize = True
                        , do_resize = False
                        , num_labels = 2, ignore_index = 0
                        )

MODEL = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-base-IN21k-ade-semantic", num_labels = 2, ignore_mismatched_sizes = True)

I have also implemented low-rank adaptation to account for catastrophic forgetting:

from peft import LoraConfig, get_peft_model

target_modules = ['q_proj','k_proj','v_proj','out_proj', 'class_predictor']

config = LoraConfig(
    r = 8
    , lora_alpha = 16
    , target_modules = target_modules
    , lora_dropout = 0.3
    , bias = "lora_only"
    , init_lora_weights = "pissa"
    , use_rslora = True
    , modules_to_save = ['decode_head']
)

LORA_MODEL = get_peft_model(MODEL, config)

OPTIMIZER = torch.optim.AdamW(LORA_MODEL.parameters(), lr = 0.00001)

My dataset randomly rotates and flips my training and validation data and it also takes a RandomCrop of the image and masks. This setup has already worked for training a LRASPP-model from the ground up.
However every time I try to train the mask2former-model after a few epochs the accuracy converges to 0 and it starts outputting only null-tensors.
Is something with my initialization wrong or not? I have already looked for different threads about this but there aren’t many. On Huggingface there is one from a year ago but this was never answered.