LayoutLMv3 missing visual tokenizer?

Hi,

How come LayoutLMv3 does not have a visual tokenizer in its code? The image goes straight from pixel values to embeddings.

Hi,

The “tokenization” happens inside the model, using a convolutional 2D layer. One just needs to provide pixel_values to the model, which will be turned into embedded patches.

1 Like

Hey @nielsr, first, thanks for the reply.

So you’re saying that we do not need discrete integers to serve as ground-truth labels? Reading the paper the authors seem to use a VAE trained on document data. How come HF decided not to do the same? What’s the theoretical baggage behind it?

Any help is appreciated

EDIT:
If possible can you fundament why text needs to be tokenized and discrete while images do not?

Some relevant literature for the discussion:

From BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Existing MIM approaches can be coarsely categorized to three according to the reconstruction targets:
low-level image elements (e.g., raw pixels; He et al. 2022; Fang et al. 2022; Liu et al. 2022), hand-
crafted features (e.g., HOG features; Wei et al. 2021), and visual tokens; Bao et al. 2022; Wang et al.
2022; Dong et al. 2021; El-Nouby et al. 2021; Chen et al. 2022. However, all the reconstruction
targets are about, explicitly or implicitly, low-level image elements while underestimating high-level
semantics. In comparison, the masked words in language modeling (Devlin et al., 2019) are all about
high-level semantics, which motivates us to tap the potential of MIM by exploiting semantic-aware
supervision during pretraining.

I’d still love to know the reasoning behind HF departure from the OG paper.

Hi,

We haven’t deviated from the original paper, Microsoft just didn’t open-source this visual tokenizer used for pre-training (the masked image modeling objective).

Note that after pre-training, the masked image modeling head is thrown away, and replaced by a classification head for downstream tasks.

1 Like

Oh, I see. For my purposes, I’ve been using a random VAE from DALL-E to learn the inner workings of pre-training a model. Can you shed some light on the effects of not using a visual tokenizer?

Also, thanks btw. I’ve been using a lot of your stuff to wrap my head around LayoutLMv3.

Pre-training using a masked image modeling (MIM) objective where the model needs to predict tokens of a VAE for masked patches is pretty powerful, since the VAE contains useful information in its codebook. BEiT was the first work to do so, and got state-of-the-art performance on ImageNet after fine-tuning. This was then improved in BEiTv2.

LayoutLMv3 adopts the same pre-training objective (among other objectives), but with a VAE especially trained to reconstruct document images.

Other models, like MAE and SimMIM also do masked image modeling, but instead of predicting tokens of a VAE for masked patches, they directly predict the raw pixel values. This has the benefit of not requiring a separate VAE.

BEiTv2 seem to be better than MAE at the moment (as shown in the figure below).

1 Like

Thank you for contextualizing the answer. :raised_hands: