How to build VitMAE encoder with Unet Decoder for semmantic segmantation

I’m using a pretrained ViTMAE and trying to use the encoder portion to train a model on a downstream segmentation task, of 1 class. I don’t understand how to use the hidden states to train the Unet decoder. If anyone can provide guidance it would be of much help.