I want to train a Seq2Seq model on a custom dataset. My input sequnce consists of a list of keywords (which I have concatenated into a single String) and the output sequence is a text. The model is supposed to learn to create a text from these keywords. Since the number of keywords in the input data is larger than the number of keywords that I actually want to train and use the model on, I want to use a data augmentation strategy where I randomly select some of these keywords and shuffle the order. As far as I understood I could do this in two ways:
- On the level of data by just inflating the dataset with samples of shuffled subsets
- On the fly while training
I think approach 2 would be a little bit more elegant since I don’t need to store redundant data.
My question is where to inject this in the pipeline? With pytorch for example I could add transformations to a dataset class. But the approach in the huggingface library seems to be to use a custom
DataCollator? Maybe the
DataCollatorForPermutationLanguageModeling is already sufficient for this? Could someone hint me towards the right direction here?