Pre-training using a masked image modeling (MIM) objective where the model needs to predict tokens of a VAE for masked patches is pretty powerful, since the VAE contains useful information in its codebook. BEiT was the first work to do so, and got state-of-the-art performance on ImageNet after fine-tuning. This was then improved in BEiTv2.
LayoutLMv3 adopts the same pre-training objective (among other objectives), but with a VAE especially trained to reconstruct document images.
Other models, like MAE and SimMIM also do masked image modeling, but instead of predicting tokens of a VAE for masked patches, they directly predict the raw pixel values. This has the benefit of not requiring a separate VAE.
BEiTv2 seem to be better than MAE at the moment (as shown in the figure below).