I am fine tuning a T5 model to translate between a custom pair of languages (Classical Nahuatl and English); the T5 was not trained on Classical Nahuatl, of course. I trained my own tokenizers for both the input language (Classical Nahuatl) and the target language (English).
I’m following the tutorial for translation. From my own implementation, it seems that the target texts are tokenized with the input language’s tokenizer in the following line of code: model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
in the preprocess_function
function. That is, the target language’s tokenizer is not used at all, and only the input language’s tokenizer is used for both the input and target sentences.
How do I apply the target language’s tokenizer on the target sentences? Will I need to tokenize the target sentences before implementing the code above?