How to re-tokenize the training set in each epoch?

I have a special tokenizer which can tokenize the sentence based on some propability distribution.
For example, ‘I like green apple’ ->‘[I],[like],[green],[apple]’(30%) or ‘[I],[like],[green apple]’ (70%).
Now in the training part, I want the Trainer can retokenize the dataset in each epoch. How can I do so?

You can define a function that does your special tokenization then set it as the transform for the dataset. This will result in your function being called every time a data sample is loaded

def tokenize_function(examples):
    if random.random() < 0.3:
        return tokenizer_1(
            examples['text'],
            **tokenizer_kwargs
        )
    else:
        return tokenizer_2(
            examples['text'],
            **tokenizer_kwargs
        )

train_dataset = train_dataset.shuffle(seed=training_args.seed)
train_dataset.set_transform(tokenize_function)
# You can then pass this train_dataset to your Trainer

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.