I am fine tuning a T5 model to translate between a custom pair of languages (Classical Nahuatl and English); the T5 was not trained on Classical Nahuatl, of course. I trained my own tokenizers for both the input language (Classical Nahuatl) and the target language (English).
Iām following the tutorial for translation. From my own implementation, it seems that the target texts are tokenized with the input languageās tokenizer in the following line of code: model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
in the preprocess_function
function. That is, the target languageās tokenizer is not used at all, and only the input languageās tokenizer is used for both the input and target sentences.
How do I apply the target languageās tokenizer on the target sentences? Will I need to tokenize the target sentences before implementing the code above?