I have a special tokenizer which can tokenize the sentence based on some propability distribution.
For example, ‘I like green apple’ ->‘[I],[like],[green],[apple]’(30%) or ‘[I],[like],[green apple]’ (70%).
Now in the training part, I want the Trainer can retokenize the dataset in each epoch. How can I do so?
You can define a function that does your special tokenization then set it as the transform for the dataset. This will result in your function being called every time a data sample is loaded
def tokenize_function(examples):
if random.random() < 0.3:
return tokenizer_1(
examples['text'],
**tokenizer_kwargs
)
else:
return tokenizer_2(
examples['text'],
**tokenizer_kwargs
)
train_dataset = train_dataset.shuffle(seed=training_args.seed)
train_dataset.set_transform(tokenize_function)
# You can then pass this train_dataset to your Trainer
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.