Copying mechanism for transformer

lifelongeek · March 24, 2021, 2:39am

Hello.
HF community members
I wonder how do you think about the copying mechanism for transformer.

I can see very few papers/tech reports implementing copying mechanism for transformer.

Also, I couldn’t find anyone who discusses copying mechanism in this forum.

Personally, I am stuck with computing ‘generating-copying switch’ since transformer does not have explicit ‘context vector’ in RNN.

Do you have any thoughts about the lack of reference/discussion for copying mechanism?
Is it worth implement & contribute to HF community with copying mechanism?

bigheiniu · March 30, 2021, 8:04pm

Hi,
I have tried a copy mechanism in the BART model. I directly utilize the cross-attention as the attention score for the source samples. This idea is from openNMT CopyGenerator.
My implementation is like this:

def copy_mechanism_v3(self, logits, cross_attentions, decoder_hidden_states, encoder_input_ids):
    last_hidden_state = decoder_hidden_states[-1]
    last_attention_weight = cross_attentions[-1]
    # context_vector shape: batch_size, decoder_length, hidden_size
    p_copy = torch.sigmoid(self.linear_copy(last_hidden_state))
    previous_word_pro = torch.softmax(logits, dim=-1) * (1 - p_copy)
    encoder_word_attention = p_copy * torch.mean(last_attention_weight, dim=1)
    
    # did not copy the pad
    mask = torch.where(encoder_input_ids == 1,
                       encoder_word_attention.new_zeros(encoder_input_ids.shape),
                       encoder_word_attention.new_ones(encoder_input_ids.shape))
    encoder_word_attention = encoder_word_attention * mask.unsqueeze(1)
    
    personal_words = encoder_input_ids.unsqueeze(1).repeat(1, encoder_word_attention.shape[1], 1)
    word_pro = torch.scatter_add(previous_word_pro, 2, personal_words, encoder_word_attention)
    return word_pro

vnik18 · June 2, 2021, 6:28pm

Hi, this looks interesting! Can you share more about where exactly you use this function during the training process? For example, with reference to this file: transformers/run_summarization.py at master · huggingface/transformers · GitHub

Thank you! @bigheiniu

jogonba2 · December 16, 2021, 1:53am

Hi, possibly is a bit late but I was working on implementing the copy mechanism to MBart and released a gist: https://gist.github.com/jogonba2/ff9233023a406a45c655bbe090e3b05b

I never get better results using the copy mechanism. Most of the times, it works slightly better to use only the pretrained model without the copy mechanism. I’m trying to further pretraining MBartHez along with the copy mechanism to see what happens. Also, there are some weird things:

In my experiments, the p_gen is almost always between 0.97 and 0.99., so, the final distribution (copy+gen) is very similar to the distribution of the decoder (gen), even in extractive tasks.
During inference, the generate method gives a different output than trainer.predict.

The background for the implementation is this paper: https://aclanthology.org/2020.acl-main.125.pdf . There is more information in the comments of the code.

Hope it helps!

tobigue · December 16, 2021, 12:25pm

Hey @jogonba2, have you tried to verify the implementation of the copy mechanism? For example by using only the copy distribution (force setting p_gen to 0) and training and testing the model on the simple task of just copying the complete input to the output?

I’m currently trying to add the copy mechanism to T5 and currently my model is not able to do this yet.

jogonba2 · December 16, 2021, 1:09pm

Hi @tobigue.
I tested the p_gen=0 and p_gen=1 cases and the final distribution is the copy or the generation distribution respectively as expected. But I don’t tested it on “fully extractive” tasks.

Also, I did few experiments on my downstream task (keyword extraction) fixing the p_gen to the percentage of novel words and it seems to work better than learning the p_gen value. For some reason p_gen is almost always very close to 1, but I’m not sure it is a problem.

I think the implementation could be very similar for T5 models.

hoangftran · May 23, 2022, 2:56am

Hi @jogonba2, your github gist url is not found (or deleted). Can you check it or upload again? Thank you very much.

jogonba2 · May 23, 2022, 10:34am

Hi @hoangftran,

the gist was moved to another url, this is the new one: https://gist.github.com/jogonba2/f67d129e254054a918bf428d2e35aca4

Thanks for letting me know!

bobshih · May 31, 2022, 6:59am

I have studied your implementation.
It’s great. Thanks a lot.

After I try to re-implement with encoder decoder model, I found there is a slicing problem (or indexing ?) at line 144.
tensor e is been assigned -100 for its almost all values.
I am not sure if it happens in bart models.
I use the bert models instead of bart ones.

I fixed by

        e = e.permute(0, 2, 1)
        e[(encoder_input_ids == self.config.pad_token_id),] = -100
        e = e.permute(0, 2, 1)

Because I am not very familiar with the slicing methods, it looks a little dirty.
Please let me know if there is any other better way to do it.

Besides, it may get a better result if using the next-token prediction.

humbleyl · February 23, 2024, 4:23am

如何在Huggingface模型中优雅地加入Copy机制(PGN)？ - 知乎 (zhihu.com)

Topic		Replies	Views
Tutorial: Implementing Transformer from Scratch - A Step-by-Step Guide Show and Tell	5	4346	May 1, 2025
Vanilla Transformer Beginners	1	1177	June 6, 2023
Reproduce attention is all you need Beginners	0	480	June 25, 2022
[Proposal] Copy Pasting modeling_bart.py 🤗Transformers	4	602	September 15, 2020
Vall-e and Vall-e X implementation 🤗Transformers	0	398	September 27, 2023

Copying mechanism for transformer

Related topics