Hi all,
I am seeing weird behavior with mBART-50 and Spanish. Please look at the code below:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
text = "http://www.ted.com/talks/stephen_palumbi_following_the_mercury_trail.html"
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
tokenizer.src_lang = "es_XX"
encoded = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
The output is:
['(b) To continue to cooperate closely with the Special Rapporteur on extrajudicial, summary or arbitrary executions, the Special Rapporteur on torture and other cruel, inhuman or degrading treatment or punishment, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on torture and other cruel, inhuman or degrading treatment or punishment, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on violence against women, its causes and consequences, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special Rapporteur on the sale of children, child prostitution and child pornography, the Special']
However if I change the source language to french tokenizer.src_lang = "fr_XX"
or any other language, I get the following output (which is what you expect):
['http://www.ted.com/talks/stephen_palumbi_following_the_mercury_trail.html']
This behavior is similarly with other texts as well (e.g., 888). Do you know why this behavior is unique to Spanish? Also, do you have any idea how to correct this behavior?
Thanks!