Hey, I leveraged the pretrained and finetuned BEiT for semantic seg model from microsoft (‘microsoft/beit-base-finetuned-ade-640-640’) and tested on a few pictures from the ADE20K test set. I looked at certain examples and even metrics and contrasted the same with the pretrained model available on the original unilm/bert repository on github. I got very poor results with the model available on hugging in comparison with the unilm/beit - 32 vs 53 mIoU. Why could this be the case? The implementation is very simple, taking the pretrained model and preprocessing only inlcudes resizing (640x640) and rescaling ([0,1] range) + normalizing (img net values) since we are in the inference phase.