I’m trying to retrain t5-small
with a japanese to spanish dataset, I want to retrain the tokenizer to handle the words in those languages
Currently I’ve done this:
def get_training_corpus(lang: str):
ds = dataset["train"]
for start_idx in range(0, len(ds), 1000):
samples = ds[start_idx : start_idx + 1000]
yield samples[lang]
tokenizer = AutoTokenizer.from_pretrained("t5-small")
new_tokenizer = tokenizer.train_new_from_iterator(
get_training_corpus("ja"),
52000,
)
but I don’t know how to also train the target side of the tokenizer, I would like to be able to tokenize in japanese like
model_inputs = tokenizer(examples["ja"], max_length=max_input_length, truncation=True)
and in spanish using
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["es"], max_length=max_target_length, truncation=True)