I would like to fine-tune T5 for diverse paraphrase generation. For each original sentence, I would like to have several different paraphrases generated, but current results contain sentences very similar to each other.
Example:
Original Question ::
What is the expected close date of the opportunity
Paraphrased Questions Generated by T5::
0: What will be the expected close date of the opportunity?
1: What is the expected closing date for the opportunity that you are considering?
2: What is the expected close date of the opportunity?
3: What is the expected close date on the opportunity?
4: When would be the expected close date of the opportunity?
I tried to add diversity measure in the training but was notified it wouldn’t work.
Thus, I want to directly force the decoder to avoid the repetition of ngrams between generated sentences during testing. The ‘generate’ function has two parameters: repetition_penalty, no_repeat_ngram_size. I check the paper and the source code, if I understand correctly, they just avoid repetition along the beam rather than between the sentences. No surprise: I tried different values of the two parameters and there seems no effect.
Thus, I was wondering if there is any simple way to penalize the repetition between sentences? My thought is, during beam search, to penalize the probabilities of repetitive words on different branches at the same/ nearby step. Is there open source code available for this? If not, is there anything I need to pay attention to when I modified the 'generate()’ function for this?