Why does Bart decoder's attention mask mark relevant indices with 0 instead of 1?

seongmin · May 31, 2021, 2:40am

Hi.

When we don’t pass decoder_attention_mask to BartModel, the model automatically creates decoder input masks with _make_causal_mask.

I’ve noticed that the method inserts ‘0’ in mask positions corresponding to indices the model needs to attend, and -inf in positions corresponding to indices to be ignored. Below is the link to aforementioned code:

github.com

huggingface/transformers/blob/8d171628fe84bdf92ee40b5375d7265278180f14/src/transformers/models/bart/modeling_bart.py#L85

    
      
              return shifted_input_ids
          
          

          
def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0):
              """
              Make causal mask used for bi-directional self-attention.
              """
              bsz, tgt_len = input_ids_shape
              mask = torch.full((tgt_len, tgt_len), float("-inf"))
              mask_cond = torch.arange(mask.size(-1))
              mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
              mask = mask.to(dtype)
          
          
    if past_key_values_length > 0:
                  mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1)
              return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
          
          

          
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
              """
              Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.

As far as I know attention masks should have 1 in indices we want to attend. Could anyone shed some light on this?

seongmin · May 31, 2021, 3:03am

Further investigation shows this behavior is desired since attention mask is added to attention weights, so that 0 attention mask value preserves the inputs while -inf attention mask value “masks out” the inputs. (related code pasted below)

However, shouldn’t the encoder attention mask be initialized the same way (0 for relevant inputs, -inf for padding inputs) as well?

Currently, the documentation says encoder attention mask values should be 1 for relevant inputs and 0 for padding inputs.

github.com

huggingface/transformers/blob/8d171628fe84bdf92ee40b5375d7265278180f14/src/transformers/models/bart/modeling_bart.py#L223

    
      
          if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
              raise ValueError(
                  f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is {attn_weights.size()}"
              )
          
          
if attention_mask is not None:
              if attention_mask.size() != (bsz, 1, tgt_len, src_len):
                  raise ValueError(
                      f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
                  )
              attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
              attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
          
          
attn_weights = F.softmax(attn_weights, dim=-1)
          
          
if layer_head_mask is not None:
              if layer_head_mask.size() != (self.num_heads,):
                  raise ValueError(
                      f"Head mask for a single layer should be of size {(self.num_heads,)}, but is {layer_head_mask.size()}"
                  )
              attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)

Topic		Replies	Views
Training BART, error when preparing decoder_input_ids. Shape of input_ids? Beginners	3	1454	August 7, 2020
Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks? 🤗Transformers	0	690	May 26, 2022
Is attention_mask needed for training Bart? Beginners	1	206	March 10, 2021
BartDecoder outputs perfect predictions even when untrained Beginners	0	149	October 27, 2023
A question about the modeling_bart.py Models	1	324	November 12, 2020

Why does Bart decoder's attention mask mark relevant indices with 0 instead of 1?

Related topics