Do automatically generated attention masks ignore padding?

For some reason, I thought that when an attention mask was not given to a model, an attention mask was automatically generated that considers padding tokens. I always include the full encoding of the tokenizer (input IDs, attention mask, type-token) out of habit but then I saw this example which does not do that. They only pass the input IDs. At first I would have expected this to work fine, but if you include attention masks explicitly you get different results than without.

To confirm, I looked at the source code of BERT and I find that, indeed, when an attention mask is not given, a mask is created of batch_size x sequence length of all 1’s (no special treatment for padding).

So can anyone confirm? Relying on the automatically generated attention mask is not enough because it does not block attention on padding tokens, is that correct? This implies that you should always pass the full encoding of a tokenizer to model(**encoding) to ensure that the attention mask is included.

1 Like

Yes, you need to pass the attention mask returned by the tokenizer. Most models don’t know the padding token ID, so they can’t generate an attention mask that ignores it.

Thanks for confirming! I felt like I was going crazy.

Has this ever been different in your memory? I have vague memories where the attention mask was created based on the pad token ID (as it is in the tokenizer currently), but it might be that I am confusing tokenizers source code with model’s forward's.

Not in the past two years at least. Can’t speak for older than this :slight_smile:

I went back to 0.6.2 of pytorch_pretrained_bert, and there it is also simply

So I must have been switching things up. Thanks for giving me some peace of mind. :wink:

2 Likes