Basic questions about padding in the Original ViT

Hello,

I have three questions about using padding. When our patches don’t have the same length, we add padding to make fixed-size inputs. However, we need to identify which patches are paddings so the model don’t attend to that parts.

  1. Based on the code, we add the interpolated positional embedding to the padded patches. Based on the code, we understand that the position embedding is considering the padded patches like the other ones. This is problematic as for different images we have a different paddings and every time it is considered as part of the input. So how the ViT is handling this?

  2. For applying the mask on the patches (bool_masked_pos), we use a parameter called mask_tokens. This parameter in the original ViT is None. In dino, it is all zeros. Now that I do need this parameter should I also consider it as all zeros?

  3. Based on online resources (here), in addition to bool_masked_pos we need to use head_mask parameter. But I noticed the size of the head mask should be number_heads which means It only allows me to keep or ignore a head completely. So how it can affect padding? And how can I choose the heads to be masked?

Overall, what should I do when I have different paddings for different images and I don’t want the model attend to that padded tokens?

Thank you so much for your time and answer. :pray: