Combining encoder from one model and a decoder for another for image reconstruction

Hi! I’m learning the way of the land around Hugging Face code, specifically reading the code behind LayoutLMV3 and MAE.

If my objective is to do a pre-training of masked image reconstruction, much like what ViT-MAE decoder does, but with the encoder from another model like LayoutLMv3, can I modify:

class ViTMAEForPreTraining(ViTMAEPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.config = config
        
        // change the following original code from
        // self.vit = ViTMAEModel(config)
        // to instead 
        self.encoder = Encoder(config) 
        // where Encoder is derived from 
        // class LayoutLMv3Model(LayoutLMv3PreTrainedModel) for example
        self.decoder = ViTMAEDecoder(config, num_patches=self.vit.embeddings.num_patches)

        # Initialize weights and apply final processing
        self.post_init()

My initial hunch is that because LayoutLMv3 returns a BaseModelOutput while the encoder block of MAE returns a VitMAEModelOutput which includes ids_restore, it won’t work. Does this mean we will also have to create a new ModelOutput class?

If it is not as simple as this, then do I need to create a new Embedding class for VitMAE that takes in encoding like LayoutLMV3? Any help would be greatly appreciated, thanks!