Issue with MBart50 translation

Hi,

I am having an issue with the new MBart50 - I was wondering if you could help me figure out what I am doing wrong.

I am trying to copy code from here – specifically, I tweaked it to translate a sentence from French into Persian.

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_fr = "Paris est toujours une bonne idee"

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "fr_XX"
encoded_hi = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(
    **encoded_hi,
    forced_bos_token_id=tokenizer.lang_code_to_id["fa_IR"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

but it then outputs

['Paris is always a good idea']

(which is obviously in English – not in Persian)

How can I get it to output in Persian? I tried using the "fa_IR" lang_code_to_id.

Thanks

I have the following returned. However, longer strings yield FR results, not 100% sure why. I assume it is a lack of training sentence pairs. Good luck!

Ψ³Ω„Ψ§Ω…

Returned from this snippet.

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_fr = "Bonjour"

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="fr_XX")

model_inputs = tokenizer(article_fr, return_tensors="pt")

generated_tokens = model.generate(
    **model_inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id["fa_IR"]
)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
1 Like

Yes – thanks! I get the same results for β€œBonjour” as well.