Fine tuning a T5 model for translation - How do I apply my trained tokenizer to the target sentences?

ourdearbenefactor · July 20, 2024, 5:00am

I am fine tuning a T5 model to translate between a custom pair of languages (Classical Nahuatl and English); the T5 was not trained on Classical Nahuatl, of course. I trained my own tokenizers for both the input language (Classical Nahuatl) and the target language (English).

I’m following the tutorial for translation. From my own implementation, it seems that the target texts are tokenized with the input language’s tokenizer in the following line of code: model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) in the preprocess_function function. That is, the target language’s tokenizer is not used at all, and only the input language’s tokenizer is used for both the input and target sentences.

How do I apply the target language’s tokenizer on the target sentences? Will I need to tokenize the target sentences before implementing the code above?

Topic		Replies	Views
How to train target tokenizer 🤗Tokenizers	0	559	August 30, 2022
Fine-tuning T5 for translation Beginners	0	1296	November 9, 2021
Customizing T5 tokenizer for finetuning 🤗Transformers	1	618	May 2, 2024
Finetune t5 for English-Vietnamese translation 🤗Transformers	2	1092	May 28, 2022
How to tokenize input if I plan to train a Machine Translation model. I'm having difficulties with text_pair argument of Tokenizer() Beginners	4	1919	November 4, 2021

Fine tuning a T5 model for translation - How do I apply my trained tokenizer to the target sentences?

Related topics