Basic questions about padding in the Original ViT

FatemehBehrad · April 25, 2024, 6:46am

Hello,

I have three questions about using padding. When our patches don’t have the same length, we add padding to make fixed-size inputs. However, we need to identify which patches are paddings so the model don’t attend to that parts.

Based on the code, we add the interpolated positional embedding to the padded patches. Based on the code, we understand that the position embedding is considering the padded patches like the other ones. This is problematic as for different images we have a different paddings and every time it is considered as part of the input. So how the ViT is handling this?
For applying the mask on the patches (bool_masked_pos), we use a parameter called mask_tokens. This parameter in the original ViT is None. In dino, it is all zeros. Now that I do need this parameter should I also consider it as all zeros?
Based on online resources (here), in addition to bool_masked_pos we need to use head_mask parameter. But I noticed the size of the head mask should be number_heads which means It only allows me to keep or ignore a head completely. So how it can affect padding? And how can I choose the heads to be masked?

Overall, what should I do when I have different paddings for different images and I don’t want the model attend to that padded tokens?

Thank you so much for your time and answer.

Topic		Replies	Views
Is it possible to train ViT with different number of patches in every batch? (Non-square images dataset) Models	3	2992	May 1, 2024
Attention mask and token ids Awesome paper	1	2263	October 18, 2022
CLIPVisionModel Padding Problem 🤗Transformers	2	154	November 18, 2024
Seq2seq padding 🤗Transformers	1	69	October 10, 2024
Visual Tokenization / Masking In BEIT & LayoutLMv3 Intermediate	1	544	December 23, 2022

Basic questions about padding in the Original ViT

Related topics