DataCollator for selecting a random subset and permutation

moritzwilke · July 20, 2023, 11:23am

Hi,
I want to train a Seq2Seq model on a custom dataset. My input sequnce consists of a list of keywords (which I have concatenated into a single String) and the output sequence is a text. The model is supposed to learn to create a text from these keywords. Since the number of keywords in the input data is larger than the number of keywords that I actually want to train and use the model on, I want to use a data augmentation strategy where I randomly select some of these keywords and shuffle the order. As far as I understood I could do this in two ways:

On the level of data by just inflating the dataset with samples of shuffled subsets
On the fly while training

I think approach 2 would be a little bit more elegant since I don’t need to store redundant data.

My question is where to inject this in the pipeline? With pytorch for example I could add transformations to a dataset class. But the approach in the huggingface library seems to be to use a custom DataCollator? Maybe the DataCollatorForPermutationLanguageModeling is already sufficient for this? Could someone hint me towards the right direction here?

Topic		Replies	Views
Is there a DataCollator for Question Answering? Beginners	1	455	September 13, 2021
How to use Seq2seq Trainer with my original "[MASK]" Beginners	2	719	October 22, 2020
Customizing the ordering of training samples 🤗Transformers	1	832	September 16, 2021
Create custom data_collator for Huggingface Trainer 🤗Transformers	1	4119	July 22, 2022
Data collation: cannot understand the logics of the API 🤗Transformers	0	26	September 2, 2024

DataCollator for selecting a random subset and permutation

Related topics