LayoutLMv3 missing visual tokenizer?


How come LayoutLMv3 does not have a visual tokenizer in its code? The image goes straight from pixel values to embeddings.


The “tokenization” happens inside the model, using a convolutional 2D layer. One just needs to provide pixel_values to the model, which will be turned into embedded patches.

1 Like

Hey @nielsr, first, thanks for the reply.

So you’re saying that we do not need discrete integers to serve as ground-truth labels? Reading the paper the authors seem to use a VAE trained on document data. How come HF decided not to do the same? What’s the theoretical baggage behind it?

Any help is appreciated

If possible can you fundament why text needs to be tokenized and discrete while images do not?

Some relevant literature for the discussion:

From BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Existing MIM approaches can be coarsely categorized to three according to the reconstruction targets:
low-level image elements (e.g., raw pixels; He et al. 2022; Fang et al. 2022; Liu et al. 2022), hand-
crafted features (e.g., HOG features; Wei et al. 2021), and visual tokens; Bao et al. 2022; Wang et al.
2022; Dong et al. 2021; El-Nouby et al. 2021; Chen et al. 2022. However, all the reconstruction
targets are about, explicitly or implicitly, low-level image elements while underestimating high-level
semantics. In comparison, the masked words in language modeling (Devlin et al., 2019) are all about
high-level semantics, which motivates us to tap the potential of MIM by exploiting semantic-aware
supervision during pretraining.

I’d still love to know the reasoning behind HF departure from the OG paper.


We haven’t deviated from the original paper, Microsoft just didn’t open-source this visual tokenizer used for pre-training (the masked image modeling objective).

Note that after pre-training, the masked image modeling head is thrown away, and replaced by a classification head for downstream tasks.

1 Like

Oh, I see. For my purposes, I’ve been using a random VAE from DALL-E to learn the inner workings of pre-training a model. Can you shed some light on the effects of not using a visual tokenizer?

Also, thanks btw. I’ve been using a lot of your stuff to wrap my head around LayoutLMv3.

Pre-training using a masked image modeling (MIM) objective where the model needs to predict tokens of a VAE for masked patches is pretty powerful, since the VAE contains useful information in its codebook. BEiT was the first work to do so, and got state-of-the-art performance on ImageNet after fine-tuning. This was then improved in BEiTv2.

LayoutLMv3 adopts the same pre-training objective (among other objectives), but with a VAE especially trained to reconstruct document images.

Other models, like MAE and SimMIM also do masked image modeling, but instead of predicting tokens of a VAE for masked patches, they directly predict the raw pixel values. This has the benefit of not requiring a separate VAE.

BEiTv2 seem to be better than MAE at the moment (as shown in the figure below).

1 Like

Thank you for contextualizing the answer. :raised_hands: