Self-attention masking for T5 encoder?

ronen · February 27, 2022, 6:00pm

Hi,
I’m interested in applying self-attention masking in the T5 encoder, meaning a tensor of shape (batch, max_input_len, max_input_len), which is a binary matrix for each tensor in the batch, specifying which tokens (i,j) in the input can attend to each other.
This idea is explored for a different Transformer architecture in the paper “Improving Compositional Generalization in Classification Tasks via Structure Annotations”. It’s implemented in TF, the figure below is from that paper:

In contrast, I’d like to use the Transformers PyTorch T5 implementation if possible. The option seems to be available for BERT:

github.com

huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L261


      
          
          
    self.num_attention_heads = config.num_attention_heads
              self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
              self.all_head_size = self.num_attention_heads * self.attention_head_size
          
          
    self.query = nn.Linear(config.hidden_size, self.all_head_size)
              self.key = nn.Linear(config.hidden_size, self.all_head_size)
              self.value = nn.Linear(config.hidden_size, self.all_head_size)
          
          
    self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
              self.position_embedding_type = position_embedding_type or getattr(
                  config, "position_embedding_type", "absolute"
              )
              if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
                  self.max_position_embeddings = config.max_position_embeddings
                  self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
          
          
    self.is_decoder = config.is_decoder
          
          
def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
              new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)

But for T5 I haven’t been able to find an equivalent- the encoder_attention_mask there is of dimension (batch_size, encoder_seq_len):

github.com

huggingface/transformers/blob/84eaa6acf582206dba33135727dc3bfff05a7e9c/src/transformers/models/t5/modeling_t5.py#L945


      
          # required mask seq length can be calculated via length of past
          mask_seq_length = past_key_values[0][0].shape[2] + seq_length if past_key_values is not None else seq_length
          
          
if use_cache is True:
              assert self.is_decoder, f"`use_cache` can only be set to `True` if {self} is used as a decoder"
          
          
if attention_mask is None:
              attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)
          if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
              encoder_seq_length = encoder_hidden_states.shape[1]
              encoder_attention_mask = torch.ones(
                  batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long
              )
          
          
# initialize past_key_values with `None` if past does not exist
          if past_key_values is None:
              past_key_values = [None] * len(self.block)
          
          
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
          # ourselves in which case we just need to make it broadcastable to all heads.
          extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)

Anyone have experience with this?
Thanks!

Topic		Replies	Views
Different masks for encoder self and cross attention 🤗Transformers	0	1101	November 8, 2022
Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided 🤗Transformers	4	2703	April 5, 2023
Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks? 🤗Transformers	0	694	May 26, 2022
Self-attention extraction from Long T5 🤗Transformers	0	246	March 5, 2024
How is padding masking considered in the Attention Head of a Transformer? 🤗Transformers	0	2739	December 6, 2022

Self-attention masking for T5 encoder?

Related topics