Transformers - repetition_penalty parameter

Hello everyone,

I am currently working on a project in which I need to translate text from japanese to english. I am using MarianMT pretrained model. My problem is that, sometimes the translated text repeat itself.

I’ve used the repetition_penalty=1.5 parameter to stop this effect, it seems to works fine for the moment.
I haven’t had enough time to go through my entire dataset and see if this parameter causes translation problems for inputs that are repeated and that I want to avoid shortening.

Are there any cons to this method?

It usually works well, but it is a bit of a blunt instrument. For example, it penalizes every token that’s repeating, even tokens in the middle/end of a word, stopwords, and punctuation. If the rep penalty is high, this can result in funky outputs.

For example, if you take the sentence: “The United States (U.S.) is the world’s third-largest country and is the largest country in the Americas.” you get tokens like this (I used the T5Tokenizer, but any tokenizer will have similar outputs):

['▁The', '▁United', '▁States', '▁(', 'U', '.', 'S', '.', ')', '▁is', '▁the', '▁third', '-', 'large', 's', 't', '▁country', '▁in', '▁the', '▁world', '▁and', '▁is', '▁the', '▁largest', '▁country', '▁in', '▁the', '▁America', 's', '.']

In this example, the repetition penalty will penalize the “s” in “the Americas” (because it already saw an “s” token). If the repetition penalty is high, the model could end up writing something weird like “… the largest country in the America”.

From my experience, a rep penalty of 1.5 is high enough that you very well might see stuff like this happen. I’d say you should proofread a bunch of your model’s outputs and lower the rep penalty if you do.

4 Likes

I see better, thank you for the details