Finetune t5 for English-Vietnamese translation

I finetuned t5-small model following this official example script.
After training, I generated the translation using the following code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
max_input_length = 128
max_target_length = 128
model_name = “models”
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = [“translate English to Vietnamese: I am so happy today”]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors=‘pt’, padding=True)
translation = model.generate(**model_inputs)
tokenizer.batch_decode(translation, skip_special_tokens=True)
[‘Tôi rt hi ngày nay’]
tokenizer.batch_decode(translation, skip_special_tokens=False)
“[‘ Tôi rt hi ngày nay’]”

As you can see, many utf-8 characters are not recognized. The true translation would be: “Tôi rất hạnh phúc ngày hôm nay”.

What has gone wrong? Do I need to train from the scratch given that Vietnamese is not used for pre-training of t5?

Since this site does not allow special characters, I’m adding the following figure to illustrate the output when skip_special_tokens = False.

2 Likes

I am facing the same problem, someone help please!!