Hello everyone,
I can’t understand the role of the attention_mask hyperparameter in transformers.BertModel
. I mean, when we pad a sequence, the algorithm adds zeros to the end of the sentence (and zero is specific to padding). Why should we specify attention_mask while zero amounts are obviously paddings (and even will be deleted during matrix multiplications of encoder layers)?