How to train target tokenizer

nubol23 · August 30, 2022, 2:48am

I’m trying to retrain t5-small with a japanese to spanish dataset, I want to retrain the tokenizer to handle the words in those languages

Currently I’ve done this:

def get_training_corpus(lang: str):
    ds = dataset["train"]
    for start_idx in range(0, len(ds), 1000):
        samples = ds[start_idx : start_idx + 1000]
        yield samples[lang]

tokenizer = AutoTokenizer.from_pretrained("t5-small")
new_tokenizer = tokenizer.train_new_from_iterator(
    get_training_corpus("ja"), 
    52000,
)

but I don’t know how to also train the target side of the tokenizer, I would like to be able to tokenize in japanese like

model_inputs = tokenizer(examples["ja"], max_length=max_input_length, truncation=True)

and in spanish using

with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["es"], max_length=max_target_length, truncation=True)

Topic		Replies	Views
Fine tuning a T5 model for translation - How do I apply my trained tokenizer to the target sentences? 🤗Tokenizers	0	39	July 20, 2024
How to train mBart or any multilingual model for translation task Beginners	0	254	January 4, 2023
Fine-tuning T5 for translation Beginners	0	1302	November 9, 2021
Train T5 tokenizer Beginners	4	1979	May 31, 2024
Use t5-small ft a English to German Bidirectional translation model 🤗Transformers	3	21	February 21, 2025

How to train target tokenizer

Related topics