I finetuned t5-small model following this official example script.
After training, I generated the translation using the following code:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
max_input_length = 128
max_target_length = 128
model_name = âmodelsâ
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = [âtranslate English to Vietnamese: I am so happy todayâ]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors=âptâ, padding=True)
translation = model.generate(**model_inputs)
tokenizer.batch_decode(translation, skip_special_tokens=True)
[âTĂ´i rt hi ngĂ y nayâ]
tokenizer.batch_decode(translation, skip_special_tokens=False)
â[â TĂ´i rt hi ngĂ y nayâ]â
As you can see, many utf-8 characters are not recognized. The true translation would be: âTĂ´i rẼt hấnh phĂşc ngĂ y hĂ´m nayâ.
What has gone wrong? Do I need to train from the scratch given that Vietnamese is not used for pre-training of t5?