Visual Tokenization / Masking In BEIT & LayoutLMv3

mvalente · December 23, 2022, 12:51pm

tldr questions:

Why do papers like LayoutLMv3 mention visual tokenizers that are absent from all open-source implementations?
Why do most BEIT implementations lack the visual tokenizer?
Why do pixel values go through PatchEmbedding straight into a decoder without any tokenization?
What is the relationship between Patch Embed, self.mask_token = nn.Parameter() and visual tokenizers.

Background:

I’m coding the heads for LayoutLMv3 pretraining. The paper claims to use the visual tokenizer from DIT to tokenize image patches. DIT is essentially BEIT with a dVAE trained on documents.

But, when I look at the code in HuggingFace and TIMM for BEIT, there’s no tokenization of image patches.

The pixel values get transformed into embeds that go straight into the encoder.

In Microsofts BEIT implementation I managed to find an example that makes use of visual tokenizers.

After successfully using the visual tokenizer, my confusion only increased and led me to question number 4.

Thank you for reading and for your attention. Any light shed on this subject would be amazing.

mvalente · December 23, 2022, 1:03pm

moment.

Is the self.mask_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) what we replace the visual embeddings to be masked with? Is the visual token a learnable parameter?

Topic		Replies	Views
LayoutLMv3 missing visual tokenizer? Beginners	7	480	January 4, 2023
LayoutLMV3 embeddings Beginners	4	1100	August 3, 2022
LayoutLMV3 for Token Classification 🤗Transformers	7	4308	June 19, 2025
Layoutlmv3 sequence_length vs token_sequnce_length size mismatch Models	2	697	November 19, 2022
How to properly train BEiT for Masked Image Modeling Intermediate	0	943	March 7, 2022

Visual Tokenization / Masking In BEIT & LayoutLMv3

Related topics