When we don’t pass decoder_attention_mask to BartModel, the model automatically creates decoder input masks with _make_causal_mask.
I’ve noticed that the method inserts ‘0’ in mask positions corresponding to indices the model needs to attend, and -inf in positions corresponding to indices to be ignored. Below is the link to aforementioned code:
As far as I know attention masks should have 1 in indices we want to attend. Could anyone shed some light on this?
Further investigation shows this behavior is desired since attention mask is added to attention weights, so that 0 attention mask value preserves the inputs while -inf attention mask value “masks out” the inputs. (related code pasted below)
However, shouldn’t the encoder attention mask be initialized the same way (0 for relevant inputs, -inf for padding inputs) as well?
Currently, the documentation says encoder attention mask values should be 1 for relevant inputs and 0 for padding inputs.