How to re-tokenize the training set in each epoch?

kirp · August 29, 2023, 3:35am

I have a special tokenizer which can tokenize the sentence based on some propability distribution.
For example, ‘I like green apple’ ->‘[I],[like],[green],[apple]’(30%) or ‘[I],[like],[green apple]’ (70%).
Now in the training part, I want the Trainer can retokenize the dataset in each epoch. How can I do so?

Jackmin108 · August 30, 2023, 12:22am

You can define a function that does your special tokenization then set it as the transform for the dataset. This will result in your function being called every time a data sample is loaded

def tokenize_function(examples):
    if random.random() < 0.3:
        return tokenizer_1(
            examples['text'],
            **tokenizer_kwargs
        )
    else:
        return tokenizer_2(
            examples['text'],
            **tokenizer_kwargs
        )

train_dataset = train_dataset.shuffle(seed=training_args.seed)
train_dataset.set_transform(tokenize_function)
# You can then pass this train_dataset to your Trainer

system · March 16, 2024, 4:30am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using load_dataset.set_transform() function along with Trainer class 🤗Datasets	4	2600	April 26, 2021
Help defining tokenizer 🤗Tokenizers	0	282	April 28, 2023
Perform 1 Pretrain epoch on Pretrained model Beginners	0	361	July 5, 2022
Pass tokenizer to Trainer when data is already tokenized? Beginners	0	475	August 25, 2023
How to cache tokenization for the data Beginners	2	821	January 16, 2024

How to re-tokenize the training set in each epoch?

Related topics