How to apply TranslationPipeline
from English to Brazilian Portuguese?
I’ve tried the fowling approach with no success:
from transformers import pipeline
translator = pipeline(
model="t5-small",
task="translation_en_to_br"
)
translator("How old are you?", src_lang="en", tgt_lang="br")
# [{'translation_text': ' '}]
Could you give me some directions?
As far as I can tell, T5 has only been trained/finetuned on English, German, French, Romanian. You can read Section 3.1.3 in their paper . I am not aware of Brazilian Portuguese models. Also, I don’t think it has an official language code so “br” is not likely to work anyway.
Marcin
August 31, 2021, 7:39pm
3
I’ve checked the following, but it produces garbage:
pipeline('translation_en_to_br', model='Helsinki-NLP/opus-mt-en-mul')('>>br<<How old are you?')
[{'translation_text': '♫ Horatos edad tu?'}]
maybe you should try Narrativa/mbart-large ?
Thank you @Marcin ,
I had already tested it with Narrativa/mbart-large the which produces the following result:
!pip install transformers
translator = pipeline(
model="Narrativa/mbart-large-50-finetuned-opus-en-pt-translation",
task="translation_en_to_pt"
)
translator("How old are you?", src_lang="en", tgt_lang="pt")
# [{'translation_text': 'pt - - - - - - - - - - - - - - - - - - - - - - - - - -
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -'}]
Marcin
August 31, 2021, 8:37pm
5
please use the code mentioned in the model :
from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
ckpt = 'Narrativa/mbart-large-50-finetuned-opus-en-pt-translation'
tokenizer = MBart50TokenizerFast.from_pretrained(ckpt)
model = MBartForConditionalGeneration.from_pretrained(ckpt)
tokenizer.src_lang = 'en_XX'
def translate(text):
inputs = tokenizer(text, return_tensors='pt')
input_ids = inputs.input_ids
attention_mask = inputs.attention_mask
output = model.generate(input_ids, attention_mask=attention_mask, forced_bos_token_id=tokenizer.lang_code_to_id['pt_XX'])
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "How old are you?"
translation = translate(text)
print(f"text = {text}\ntranslation = {translation}")
it outputs:
text = How old are you?
translation = Quantos anos tens?
@BramVanroy ,
I’ve tried different codes (pt
, pt_br
, pt_BR
) based on Helsinki-NLP/opus-mt-en-ROMANCE model card.
Maybe it is a TranslationPipeline
related issue.
The output is correct when using the following approach:
from transformers import MBart50TokenizerFast, MBartForConditionalGeneration
ckpt = 'Narrativa/mbart-large-50-finetuned-opus-en-pt-translation'
tokenizer = MBart50TokenizerFast.from_pretrained(ckpt)
model = MBartForConditionalGeneration.from_pretrained(ckpt).to("cuda")
tokenizer.src_lang = 'en_XX'
def translate(text):
inputs = tokenizer(text, return_tensors='pt')
input_ids = inputs.input_ids.to('cuda')
attention_mask = inputs.attention_mask.to('cuda')
output = model.generate(input_ids, attention_mask=attention_mask, forced_bos_token_id=tokenizer.lang_code_to_id['pt_XX'])
return tokenizer.decode(output[0], skip_special_tokens=True)
translate("Who are you?")
#Quem és tu?