How to apply TranslationPipeline from English to Brazilian Portuguese?

How to apply TranslationPipeline from English to Brazilian Portuguese?

I’ve tried the fowling approach with no success:

from transformers import pipeline

translator = pipeline(
    model="t5-small", 
    task="translation_en_to_br"
    )

translator("How old are you?", src_lang="en", tgt_lang="br")
# [{'translation_text': '         '}]

Could you give me some directions?

As far as I can tell, T5 has only been trained/finetuned on English, German, French, Romanian. You can read Section 3.1.3 in their paper. I am not aware of Brazilian Portuguese models. Also, I don’t think it has an official language code so “br” is not likely to work anyway.

I’ve checked the following, but it produces garbage:
pipeline('translation_en_to_br', model='Helsinki-NLP/opus-mt-en-mul')('>>br<<How old are you?')
[{'translation_text': '♫ Horatos edad tu?'}]

maybe you should try Narrativa/mbart-large?

Thank you @Marcin,

I had already tested it with Narrativa/mbart-large the which produces the following result:

!pip install transformers

translator = pipeline(
    model="Narrativa/mbart-large-50-finetuned-opus-en-pt-translation", 
    task="translation_en_to_pt"
    )


translator("How old are you?", src_lang="en", tgt_lang="pt")
# [{'translation_text': 'pt - - - - - - - - - - - - - - - - - - - - - - - - - - 
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -'}]

please use the code mentioned in the model:

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration

ckpt = 'Narrativa/mbart-large-50-finetuned-opus-en-pt-translation'

tokenizer = MBart50TokenizerFast.from_pretrained(ckpt)
model = MBartForConditionalGeneration.from_pretrained(ckpt)

tokenizer.src_lang = 'en_XX'

def translate(text):
    inputs = tokenizer(text, return_tensors='pt')
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    output = model.generate(input_ids, attention_mask=attention_mask, forced_bos_token_id=tokenizer.lang_code_to_id['pt_XX'])
    return tokenizer.decode(output[0], skip_special_tokens=True)

text = "How old are you?"
translation = translate(text)

print(f"text = {text}\ntranslation = {translation}")

it outputs:

text = How old are you?
translation = Quantos anos tens?

@BramVanroy,

I’ve tried different codes (pt, pt_br, pt_BR) based on Helsinki-NLP/opus-mt-en-ROMANCE model card.

Maybe it is a TranslationPipeline related issue.

The output is correct when using the following approach:

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration

ckpt = 'Narrativa/mbart-large-50-finetuned-opus-en-pt-translation'

tokenizer = MBart50TokenizerFast.from_pretrained(ckpt)
model = MBartForConditionalGeneration.from_pretrained(ckpt).to("cuda")

tokenizer.src_lang = 'en_XX'

def translate(text):
    inputs = tokenizer(text, return_tensors='pt')
    input_ids = inputs.input_ids.to('cuda')
    attention_mask = inputs.attention_mask.to('cuda')
    output = model.generate(input_ids, attention_mask=attention_mask, forced_bos_token_id=tokenizer.lang_code_to_id['pt_XX'])
    return tokenizer.decode(output[0], skip_special_tokens=True)




translate("Who are you?")
#Quem Ă©s tu?