Inference with VitMAE by providing a mask

Hi,

I am trying to use https://huggingface.co/docs/transformers/model_doc/vit_mae: More specifically, I have an image and a mask which specifies the parts of the image I’d like to reconstruct.

As I understand the paper, the model is designed for this tasks, but looking into the code and demos I always find that the masks is generated by the forward method of the mae model.

  1. Is my understanding correct or am I missing some essential parts?
  2. Is there a way to achieve my goal without changing too much on the original code?

Thanks for your help!