Traditionally, to create the self attention mask, we zero out columns for padding tokens so they aren’t attended to. We also zero out rows for padding tokens to prevent them from attending to other tokens.
However in the implementation of _prepare_4d_attention_mask
in transformers.modeling_attn_mask_utils
, the original mask of size (batch_size, seq_len)
is broadcasted to a size of (batch_size, 1, tgt_len, seq_len)
.
As a result, if we set seq_len
the same as tgt_len
, the rows corresponding to padding tokens (in the last two dimensions) are not zeroed out, and thus padding tokens are allowed to attend to other tokens.
Why is it done this way?