Hi! I’m learning the way of the land around Hugging Face code, specifically reading the code behind LayoutLMV3 and MAE.
If my objective is to do a pre-training of masked image reconstruction, much like what ViT-MAE decoder does, but with the encoder from another model like LayoutLMv3, can I modify:
class ViTMAEForPreTraining(ViTMAEPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.config = config
// change the following original code from
// self.vit = ViTMAEModel(config)
// to instead
self.encoder = Encoder(config)
// where Encoder is derived from
// class LayoutLMv3Model(LayoutLMv3PreTrainedModel) for example
self.decoder = ViTMAEDecoder(config, num_patches=self.vit.embeddings.num_patches)
# Initialize weights and apply final processing
self.post_init()
My initial hunch is that because LayoutLMv3 returns a BaseModelOutput while the encoder block of MAE returns a VitMAEModelOutput which includes ids_restore, it won’t work. Does this mean we will also have to create a new ModelOutput class?
If it is not as simple as this, then do I need to create a new Embedding class for VitMAE that takes in encoding like LayoutLMV3? Any help would be greatly appreciated, thanks!