Is attention_mask implemented correctly in BERT?

I was browsing through the Bert model code and noticed that the attention_mask is implemented as a simple addition:

This looks strange to me because in the original implementation they map 0’s to -10000 (pre-softmax):

I searched through the file but couldn’t find an equivalent mapping so it kind of looks like it’s just adding the attention mask to the logits. Am I missing something?

In BertModel.forward, it calls into ModuleUtilsMixin.get_extended_attention_mask to update the attention mask which inverted the values in this line:

extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(dtype).min

Thanks! I didn’t realize I was just looking at the sub-layer and not the entire model.

For those in the future, here are the two lines that are relevant: