Force mBART to generate tokens in target language during backtranslation

Hi there,

When fine-tuning mBART for translation using on-the-gly backtranslation, the paper states that they force the model in the first 1000 steps to generate tokens only on the target language (to avoid simply copying the source text). Specifically, they “mask out the output probability of predicting tokens
which appear less than 1% in the target monolingual corpus

Any idea on how to do this using the huggingface library?

Thank you!

1 Like