Copying mechanism for transformer

Hello.
HF community members
I wonder how do you think about the copying mechanism for transformer.

I can see very few papers/tech reports implementing copying mechanism for transformer.

Also, I couldn’t find anyone who discusses copying mechanism in this forum.

Personally, I am stuck with computing ‘generating-copying switch’ since transformer does not have explicit ‘context vector’ in RNN.

Do you have any thoughts about the lack of reference/discussion for copying mechanism?
Is it worth implement & contribute to HF community with copying mechanism?

1 Like

Hi,
I have tried a copy mechanism in the BART model. I directly utilize the cross-attention as the attention score for the source samples. This idea is from openNMT CopyGenerator.
My implementation is like this:

def copy_mechanism_v3(self, logits, cross_attentions, decoder_hidden_states, encoder_input_ids):
    last_hidden_state = decoder_hidden_states[-1]
    last_attention_weight = cross_attentions[-1]
    # context_vector shape: batch_size, decoder_length, hidden_size
    p_copy = torch.sigmoid(self.linear_copy(last_hidden_state))
    previous_word_pro = torch.softmax(logits, dim=-1) * (1 - p_copy)
    encoder_word_attention = p_copy * torch.mean(last_attention_weight, dim=1)
    
    # did not copy the pad
    mask = torch.where(encoder_input_ids == 1,
                       encoder_word_attention.new_zeros(encoder_input_ids.shape),
                       encoder_word_attention.new_ones(encoder_input_ids.shape))
    encoder_word_attention = encoder_word_attention * mask.unsqueeze(1)
    
    personal_words = encoder_input_ids.unsqueeze(1).repeat(1, encoder_word_attention.shape[1], 1)
    word_pro = torch.scatter_add(previous_word_pro, 2, personal_words, encoder_word_attention)
    return word_pro
1 Like