Finetune t5 for English-Vietnamese translation

tdobrxl · May 12, 2022, 2:34pm

I finetuned t5-small model following this official example script.
After training, I generated the translation using the following code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
max_input_length = 128
max_target_length = 128
model_name = “models”
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = [“translate English to Vietnamese: I am so happy today”]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors=‘pt’, padding=True)
translation = model.generate(**model_inputs)
tokenizer.batch_decode(translation, skip_special_tokens=True)
[‘Tôi rt hi ngày nay’]
tokenizer.batch_decode(translation, skip_special_tokens=False)
“[‘ Tôi rt hi ngày nay’]”

As you can see, many utf-8 characters are not recognized. The true translation would be: “Tôi rất hạnh phúc ngày hôm nay”.

What has gone wrong? Do I need to train from the scratch given that Vietnamese is not used for pre-training of t5?

tdobrxl · May 13, 2022, 2:58pm

Since this site does not allow special characters, I’m adding the following figure to illustrate the output when skip_special_tokens = False.

tgh · May 28, 2022, 8:58pm

I am facing the same problem, someone help please!!

Topic		Replies	Views
Fine-tuning T5 for translation Beginners	0	1302	November 9, 2021
Errors when fine-tuning T5 Beginners	7	6484	January 3, 2022
Finetuning T5 series models with my own data Models	0	140	May 16, 2024
Finetuning mT5 for specific language pair Models	0	145	October 17, 2024
T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning 🤗Transformers	0	330	July 5, 2023

Finetune t5 for English-Vietnamese translation

Related topics