Visual Tokenization / Masking In BEIT & LayoutLMv3

tldr questions:

  1. Why do papers like LayoutLMv3 mention visual tokenizers that are absent from all open-source implementations?
  2. Why do most BEIT implementations lack the visual tokenizer?
  3. Why do pixel values go through PatchEmbedding straight into a decoder without any tokenization?
  4. What is the relationship between Patch Embed, self.mask_token = nn.Parameter() and visual tokenizers.

Background:

I’m coding the heads for LayoutLMv3 pretraining. The paper claims to use the visual tokenizer from DIT to tokenize image patches. DIT is essentially BEIT with a dVAE trained on documents.

But, when I look at the code in HuggingFace and TIMM for BEIT, there’s no tokenization of image patches.

The pixel values get transformed into embeds that go straight into the encoder.

In Microsofts BEIT implementation I managed to find an example that makes use of visual tokenizers.

After successfully using the visual tokenizer, my confusion only increased and led me to question number 4.

Thank you for reading and for your attention. Any light shed on this subject would be amazing.

:bulb: moment.

Is the self.mask_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) what we replace the visual embeddings to be masked with? Is the visual token a learnable parameter?