Why does Bart decoder's attention mask mark relevant indices with 0 instead of 1?


When we don’t pass decoder_attention_mask to BartModel, the model automatically creates decoder input masks with _make_causal_mask.

I’ve noticed that the method inserts ‘0’ in mask positions corresponding to indices the model needs to attend, and -inf in positions corresponding to indices to be ignored. Below is the link to aforementioned code:

As far as I know attention masks should have 1 in indices we want to attend. Could anyone shed some light on this?

Further investigation shows this behavior is desired since attention mask is added to attention weights, so that 0 attention mask value preserves the inputs while -inf attention mask value “masks out” the inputs. (related code pasted below)

However, shouldn’t the encoder attention mask be initialized the same way (0 for relevant inputs, -inf for padding inputs) as well?

Currently, the documentation says encoder attention mask values should be 1 for relevant inputs and 0 for padding inputs.