Force decoder to avoid repetition between generated sentences

I would like to fine-tune T5 for diverse paraphrase generation. For each original sentence, I would like to have several different paraphrases generated, but current results contain sentences very similar to each other.

Original Question ::
What is the expected close date of the opportunity
Paraphrased Questions Generated by T5::
0: What will be the expected close date of the opportunity?
1: What is the expected closing date for the opportunity that you are considering?
2: What is the expected close date of the opportunity?
3: What is the expected close date on the opportunity?
4: When would be the expected close date of the opportunity?

I tried to add diversity measure in the training but was notified it wouldn’t work.

Thus, I want to directly force the decoder to avoid the repetition of ngrams between generated sentences during testing. The generate’ function has two parameters: repetition_penalty, no_repeat_ngram_size. I check the paper and the source code, if I understand correctly, they just avoid repetition along the beam rather than between the sentences. No surprise: I tried different values of the two parameters and there seems no effect.

Thus, I was wondering if there is any simple way to penalize the repetition between sentences? My thought is, during beam search, to penalize the probabilities of repetitive words on different branches at the same/ nearby step. Is there open source code available for this? If not, is there anything I need to pay attention to when I modified the 'generate()’ function for this?

Hey @mengyahu - this sounds like a cool use case! You are right we the current generate() method it is not really possible to avoid repetitions between sentences. It’s quite a special case so I’d suggest that after this PR is merged: Big `generate()` refactor you to make a fork of the transformers repo and try to tweak the beam_scorer.update() function (or the BeamSearchScorer class in general to add a penalty as needed).

1 Like

Hey @mengyahu. I’m facing a similar issue where I’m getting repeated sentences in summaries I’m looking to produce. Did you get a chance to add this penalty for repeated sentences? Happy to help work on it if not.

@kmfoda I have not found a way to add that penalty yet. As I moved on to other projects soon after I posted the question, I did check if PR is merged as mentioned by @patrickvonplaten .
So it would be great if you continue working on this and post solutions if you find it.

Thanks @mengyahu. Actually for my use case I found that no_repeat_ngram_size worked great because I was looking to avoid sentence repetitions in the same single output. I’m guessing you want to avoid repetitions across the multiple outputs produced. Let me have a think about how that might be done and if I make some progress I’ll submit a PR.

1 Like