Do automatically generated attention masks ignore padding?

BramVanroy · March 7, 2022, 10:48pm

For some reason, I thought that when an attention mask was not given to a model, an attention mask was automatically generated that considers padding tokens. I always include the full encoding of the tokenizer (input IDs, attention mask, type-token) out of habit but then I saw this example which does not do that. They only pass the input IDs. At first I would have expected this to work fine, but if you include attention masks explicitly you get different results than without.

To confirm, I looked at the source code of BERT and I find that, indeed, when an attention mask is not given, a mask is created of batch_size x sequence length of all 1’s (no special treatment for padding).

github.com

huggingface/transformers/blob/6ed9882ddb2b6249463c855dcca6860161d91f3e/src/transformers/models/bert/modeling_bert.py#L959-L960


      
          if attention_mask is None:
              attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)

So can anyone confirm? Relying on the automatically generated attention mask is not enough because it does not block attention on padding tokens, is that correct? This implies that you should always pass the full encoding of a tokenizer to model(**encoding) to ensure that the attention mask is included.

sgugger · March 8, 2022, 3:45pm

Yes, you need to pass the attention mask returned by the tokenizer. Most models don’t know the padding token ID, so they can’t generate an attention mask that ignores it.

BramVanroy · March 8, 2022, 4:08pm

Thanks for confirming! I felt like I was going crazy.

Has this ever been different in your memory? I have vague memories where the attention mask was created based on the pad token ID (as it is in the tokenizer currently), but it might be that I am confusing tokenizers source code with model’s forward's.

sgugger · March 8, 2022, 4:08pm

Not in the past two years at least. Can’t speak for older than this

BramVanroy · March 8, 2022, 4:14pm

I went back to 0.6.2 of pytorch_pretrained_bert, and there it is also simply

github.com

huggingface/transformers/blob/b832d5bb8a6dfc5965015b828e577677eace601e/pytorch_pretrained_bert/modeling.py#L709-L711

      
        
            def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True):
                if attention_mask is None:
                    attention_mask = torch.ones_like(input_ids)

So I must have been switching things up. Thanks for giving me some peace of mind.

Topic		Replies	Views
Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided 🤗Transformers	4	2689	April 5, 2023
Is the attention mask and tokenization taken into account? Beginners	0	351	December 7, 2021
Bert attention mask question 🤗Transformers	4	1202	March 11, 2024
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022
How is padding masking considered in the Attention Head of a Transformer? 🤗Transformers	0	2724	December 6, 2022

Do automatically generated attention masks ignore padding?

Related topics