Trainer for MT with source and target tokenizers

mcmcmcmc · June 9, 2023, 3:12pm

I’m looking for examples of using HuggingFace’s Trainer with different source and target tokenizers for NMT. I see there is a as_target_tokenizer context available, but I could not find how to specify the tokenizers in the configuration.

To be specific, I have an EN-JP translation task that I want to train with HuggingFace’s Trainer. I have a HuggingFace tokenizer trained for EN and JP (separately) and I want to use them to train a vanilla seq2seq transformer model. What is the best way to set up this configuration?

I was able to find the fine-tuning T5 for EN-FR example (Translation), which seems to have a PreTrainedTokenizer that has a source and target tokenizer, but I could not find a good way to do this from scratch, where I provide the (non-pretrained) model, tokenizers, and data.

Any help would be appreciated.

Topic		Replies	Views
How to tokenize input if I plan to train a Machine Translation model. I'm having difficulties with text_pair argument of Tokenizer() Beginners	4	1951	November 4, 2021
How to train target tokenizer 🤗Tokenizers	0	566	August 30, 2022
Train T5 decoder only on a different language Models	0	457	March 16, 2021
Select Source and Target Langauge in multi-language translation models 🤗Transformers	1	381	August 14, 2024
Tokenizer from tokenizers library cannot be used in transformers.Trainer 🤗Transformers	2	634	July 30, 2021

Trainer for MT with source and target tokenizers

Related topics