Copying mechanism for transformer

Hello.
HF community members
I wonder how do you think about the copying mechanism for transformer.

I can see very few papers/tech reports implementing copying mechanism for transformer.

Also, I couldn’t find anyone who discusses copying mechanism in this forum.

Personally, I am stuck with computing ‘generating-copying switch’ since transformer does not have explicit ‘context vector’ in RNN.

Do you have any thoughts about the lack of reference/discussion for copying mechanism?
Is it worth implement & contribute to HF community with copying mechanism?

1 Like

Hi,
I have tried a copy mechanism in the BART model. I directly utilize the cross-attention as the attention score for the source samples. This idea is from openNMT CopyGenerator.
My implementation is like this:

def copy_mechanism_v3(self, logits, cross_attentions, decoder_hidden_states, encoder_input_ids):
    last_hidden_state = decoder_hidden_states[-1]
    last_attention_weight = cross_attentions[-1]
    # context_vector shape: batch_size, decoder_length, hidden_size
    p_copy = torch.sigmoid(self.linear_copy(last_hidden_state))
    previous_word_pro = torch.softmax(logits, dim=-1) * (1 - p_copy)
    encoder_word_attention = p_copy * torch.mean(last_attention_weight, dim=1)
    
    # did not copy the pad
    mask = torch.where(encoder_input_ids == 1,
                       encoder_word_attention.new_zeros(encoder_input_ids.shape),
                       encoder_word_attention.new_ones(encoder_input_ids.shape))
    encoder_word_attention = encoder_word_attention * mask.unsqueeze(1)
    
    personal_words = encoder_input_ids.unsqueeze(1).repeat(1, encoder_word_attention.shape[1], 1)
    word_pro = torch.scatter_add(previous_word_pro, 2, personal_words, encoder_word_attention)
    return word_pro
1 Like

Hi, this looks interesting! Can you share more about where exactly you use this function during the training process? For example, with reference to this file: transformers/run_summarization.py at master · huggingface/transformers · GitHub

Thank you! @bigheiniu

Hi, possibly is a bit late but I was working on implementing the copy mechanism to MBart and released a gist: https://gist.github.com/jogonba2/ff9233023a406a45c655bbe090e3b05b

I never get better results using the copy mechanism. Most of the times, it works slightly better to use only the pretrained model without the copy mechanism. I’m trying to further pretraining MBartHez along with the copy mechanism to see what happens. Also, there are some weird things:

  1. In my experiments, the p_gen is almost always between 0.97 and 0.99., so, the final distribution (copy+gen) is very similar to the distribution of the decoder (gen), even in extractive tasks.
  2. During inference, the generate method gives a different output than trainer.predict.

The background for the implementation is this paper: https://aclanthology.org/2020.acl-main.125.pdf . There is more information in the comments of the code.

Hope it helps!

Hey @jogonba2, have you tried to verify the implementation of the copy mechanism? For example by using only the copy distribution (force setting p_gen to 0) and training and testing the model on the simple task of just copying the complete input to the output?

I’m currently trying to add the copy mechanism to T5 and currently my model is not able to do this yet.

Hi @tobigue.
I tested the p_gen=0 and p_gen=1 cases and the final distribution is the copy or the generation distribution respectively as expected. But I don’t tested it on “fully extractive” tasks.

Also, I did few experiments on my downstream task (keyword extraction) fixing the p_gen to the percentage of novel words and it seems to work better than learning the p_gen value. For some reason p_gen is almost always very close to 1, but I’m not sure it is a problem.

I think the implementation could be very similar for T5 models.

1 Like

Hi @jogonba2, your github gist url is not found (or deleted). Can you check it or upload again? Thank you very much.

Hi @hoangftran,

the gist was moved to another url, this is the new one: https://gist.github.com/jogonba2/f67d129e254054a918bf428d2e35aca4

Thanks for letting me know!

1 Like

I have studied your implementation.
It’s great. Thanks a lot.

After I try to re-implement with encoder decoder model, I found there is a slicing problem (or indexing ?) at line 144.
tensor e is been assigned -100 for its almost all values.
I am not sure if it happens in bart models.
I use the bert models instead of bart ones.

I fixed by

        e = e.permute(0, 2, 1)
        e[(encoder_input_ids == self.config.pad_token_id),] = -100
        e = e.permute(0, 2, 1)

Because I am not very familiar with the slicing methods, it looks a little dirty.
Please let me know if there is any other better way to do it.

Besides, it may get a better result if using the next-token prediction.

如何在Huggingface模型中优雅地加入Copy机制(PGN)? - 知乎 (zhihu.com)