Hello.
HF community members
I wonder how do you think about the copying mechanism for transformer.
I can see very few papers/tech reports implementing copying mechanism for transformer.
Also, I couldn’t find anyone who discusses copying mechanism in this forum.
Personally, I am stuck with computing ‘generating-copying switch’ since transformer does not have explicit ‘context vector’ in RNN.
Do you have any thoughts about the lack of reference/discussion for copying mechanism?
Is it worth implement & contribute to HF community with copying mechanism?
Hi,
I have tried a copy mechanism in the BART model. I directly utilize the cross-attention as the attention score for the source samples. This idea is from openNMT CopyGenerator.
My implementation is like this:
I never get better results using the copy mechanism. Most of the times, it works slightly better to use only the pretrained model without the copy mechanism. I’m trying to further pretraining MBartHez along with the copy mechanism to see what happens. Also, there are some weird things:
In my experiments, the p_gen is almost always between 0.97 and 0.99., so, the final distribution (copy+gen) is very similar to the distribution of the decoder (gen), even in extractive tasks.
During inference, the generate method gives a different output than trainer.predict.
Hey @jogonba2, have you tried to verify the implementation of the copy mechanism? For example by using only the copy distribution (force setting p_gen to 0) and training and testing the model on the simple task of just copying the complete input to the output?
I’m currently trying to add the copy mechanism to T5 and currently my model is not able to do this yet.
Hi @tobigue.
I tested the p_gen=0 and p_gen=1 cases and the final distribution is the copy or the generation distribution respectively as expected. But I don’t tested it on “fully extractive” tasks.
Also, I did few experiments on my downstream task (keyword extraction) fixing the p_gen to the percentage of novel words and it seems to work better than learning the p_gen value. For some reason p_gen is almost always very close to 1, but I’m not sure it is a problem.
I think the implementation could be very similar for T5 models.
I have studied your implementation.
It’s great. Thanks a lot.
After I try to re-implement with encoder decoder model, I found there is a slicing problem (or indexing ?) at line 144.
tensor e is been assigned -100 for its almost all values.
I am not sure if it happens in bart models.
I use the bert models instead of bart ones.
I fixed by
e = e.permute(0, 2, 1)
e[(encoder_input_ids == self.config.pad_token_id),] = -100
e = e.permute(0, 2, 1)
Because I am not very familiar with the slicing methods, it looks a little dirty.
Please let me know if there is any other better way to do it.
Besides, it may get a better result if using the next-token prediction.