Why can padding tokens attend to other tokens in masked self attention?

leoneu · November 4, 2024, 3:57pm

Traditionally, to create the self attention mask, we zero out columns for padding tokens so they aren’t attended to. We also zero out rows for padding tokens to prevent them from attending to other tokens.

However in the implementation of _prepare_4d_attention_mask in transformers.modeling_attn_mask_utils, the original mask of size (batch_size, seq_len) is broadcasted to a size of (batch_size, 1, tgt_len, seq_len).

As a result, if we set seq_len the same as tgt_len, the rows corresponding to padding tokens (in the last two dimensions) are not zeroed out, and thus padding tokens are allowed to attend to other tokens.

Why is it done this way?

Topic		Replies	Views
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16529	March 8, 2022
How is padding masking considered in the Attention Head of a Transformer? 🤗Transformers	0	2734	December 6, 2022
Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks? 🤗Transformers	0	694	May 26, 2022
Is the attention mask and tokenization taken into account? Beginners	0	352	December 7, 2021
Clarification on the attention_mask 🤗Transformers	4	23663	May 3, 2024

Why can padding tokens attend to other tokens in masked self attention?

Related topics