Why can padding tokens attend to other tokens in masked self attention?

Traditionally, to create the self attention mask, we zero out columns for padding tokens so they aren’t attended to. We also zero out rows for padding tokens to prevent them from attending to other tokens.

However in the implementation of _prepare_4d_attention_mask in transformers.modeling_attn_mask_utils, the original mask of size (batch_size, seq_len) is broadcasted to a size of (batch_size, 1, tgt_len, seq_len).

As a result, if we set seq_len the same as tgt_len, the rows corresponding to padding tokens (in the last two dimensions) are not zeroed out, and thus padding tokens are allowed to attend to other tokens.

Why is it done this way?

1 Like