LayoutLMv3 missing visual tokenizer?

nielsr · January 4, 2023, 5:09pm

Pre-training using a masked image modeling (MIM) objective where the model needs to predict tokens of a VAE for masked patches is pretty powerful, since the VAE contains useful information in its codebook. BEiT was the first work to do so, and got state-of-the-art performance on ImageNet after fine-tuning. This was then improved in BEiTv2.

LayoutLMv3 adopts the same pre-training objective (among other objectives), but with a VAE especially trained to reconstruct document images.

Other models, like MAE and SimMIM also do masked image modeling, but instead of predicting tokens of a VAE for masked patches, they directly predict the raw pixel values. This has the benefit of not requiring a separate VAE.

BEiTv2 seem to be better than MAE at the moment (as shown in the figure below).

Topic		Replies	Views
Visual Tokenization / Masking In BEIT & LayoutLMv3 Intermediate	1	546	December 23, 2022
LayoutLMV3 embeddings Beginners	4	1119	August 3, 2022
LayoutLMv3 paper review and fine tuning code Awesome paper	0	1240	June 23, 2022
Layoutlmv2 token classification on documents having tokens larger than 512 Models	8	2335	October 20, 2022
Layoutlmv3 sequence_length vs token_sequnce_length size mismatch Models	2	708	November 19, 2022

LayoutLMv3 missing visual tokenizer?

Related topics