Is attention_mask implemented correctly in BERT?

bjlkeng · March 19, 2023, 3:00am

I was browsing through the Bert model code and noticed that the attention_mask is implemented as a simple addition:

huggingface/transformers/blob/v4.27.1/src/transformers/models/bert/modeling_bert.py#L350


      
                  relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                  attention_scores = attention_scores + relative_position_scores
              elif self.position_embedding_type == "relative_key_query":
                  relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                  relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                  attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
          
          
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
          if attention_mask is not None:
              # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
              attention_scores = attention_scores + attention_mask
          
          
# Normalize the attention scores to probabilities.
          attention_probs = nn.functional.softmax(attention_scores, dim=-1)
          
          
# This is actually dropping out entire tokens to attend to, which might
          # seem a bit unusual, but is taken from the original Transformer paper.
          attention_probs = self.dropout(attention_probs)
          
          
# Mask heads if we want to
          if head_mask is not None:

This looks strange to me because in the original implementation they map 0’s to -10000 (pre-softmax):

github.com

google-research/bert/blob/master/modeling.py#L712


      
          attention_scores = tf.multiply(attention_scores,
                                         1.0 / math.sqrt(float(size_per_head)))
          
          
if attention_mask is not None:
            # `attention_mask` = [B, 1, F, T]
            attention_mask = tf.expand_dims(attention_mask, axis=[1])
          
          
  # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
            # masked positions, this operation will create a tensor which is 0.0 for
            # positions we want to attend and -10000.0 for masked positions.
            adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
          
          
  # Since we are adding it to the raw scores before the softmax, this is
            # effectively the same as removing these entirely.
            attention_scores += adder
          
          
# Normalize the attention scores to probabilities.
          # `attention_probs` = [B, N, F, T]
          attention_probs = tf.nn.softmax(attention_scores)
          
          
# This is actually dropping out entire tokens to attend to, which might

I searched through the file but couldn’t find an equivalent mapping so it kind of looks like it’s just adding the attention mask to the logits. Am I missing something?

kakahw · November 12, 2023, 12:55am

In BertModel.forward, it calls into ModuleUtilsMixin.get_extended_attention_mask to update the attention mask which inverted the values in this line:

extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(dtype).min

bjlkeng · November 12, 2023, 4:05pm

Thanks! I didn’t realize I was just looking at the sub-layer and not the entire model.

For those in the future, here are the two lines that are relevant:

github.com

huggingface/transformers/blob/v4.27.1/src/transformers/models/bert/modeling_bert.py#L993


      
          if token_type_ids is None:
              if hasattr(self.embeddings, "token_type_ids"):
                  buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
                  buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
                  token_type_ids = buffered_token_type_ids_expanded
              else:
                  token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
          
          # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
          # ourselves in which case we just need to make it broadcastable to all heads.
          extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
          
          # If a 2D or 3D attention mask is provided for the cross-attention
          # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
          if self.config.is_decoder and encoder_hidden_states is not None:
              encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
              encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
              if encoder_attention_mask is None:
                  encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
              encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
          else:

github.com

huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L935


      
                  raise ValueError(
                      f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
                  )
          
              # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
              # masked positions, this operation will create a tensor which is 0.0 for
              # positions we want to attend and the dtype's smallest value for masked positions.
              # Since we are adding it to the raw scores before the softmax, this is
              # effectively the same as removing these entirely.
              extended_attention_mask = extended_attention_mask.to(dtype=dtype)  # fp16 compatibility
              extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(dtype).min
              return extended_attention_mask
          
          def get_head_mask(
              self, head_mask: Optional[Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
          ) -> Tensor:
              """
              Prepare the head mask if needed.
          
              Args:
                  head_mask (`torch.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):

Topic		Replies	Views
Bert attention mask question 🤗Transformers	4	1200	March 11, 2024
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022
Self-attention masking for T5 encoder? 🤗Transformers	0	1700	February 27, 2022
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16439	March 8, 2022
Training a model with custom attention masks in each layer 🤗Transformers	0	665	December 6, 2023

Is attention_mask implemented correctly in BERT?

Related topics