For some reason, I thought that when an attention mask was not given to a model, an attention mask was automatically generated that considers padding tokens. I always include the full encoding of the tokenizer (input IDs, attention mask, type-token) out of habit but then I saw this example which does not do that. They only pass the input IDs. At first I would have expected this to work fine, but if you include attention masks explicitly you get different results than without.
To confirm, I looked at the source code of BERT and I find that, indeed, when an attention mask is not given, a mask is created of batch_size x sequence length of all 1’s (no special treatment for padding).
So can anyone confirm? Relying on the automatically generated attention mask is not enough because it does not block attention on padding tokens, is that correct? This implies that you should always pass the full encoding of a tokenizer to model(**encoding)
to ensure that the attention mask is included.