DataCollator for selecting a random subset and permutation

Hi,
I want to train a Seq2Seq model on a custom dataset. My input sequnce consists of a list of keywords (which I have concatenated into a single String) and the output sequence is a text. The model is supposed to learn to create a text from these keywords. Since the number of keywords in the input data is larger than the number of keywords that I actually want to train and use the model on, I want to use a data augmentation strategy where I randomly select some of these keywords and shuffle the order. As far as I understood I could do this in two ways:

  1. On the level of data by just inflating the dataset with samples of shuffled subsets
  2. On the fly while training

I think approach 2 would be a little bit more elegant since I don’t need to store redundant data.

My question is where to inject this in the pipeline? With pytorch for example I could add transformations to a dataset class. But the approach in the huggingface library seems to be to use a custom DataCollator? Maybe the DataCollatorForPermutationLanguageModeling is already sufficient for this? Could someone hint me towards the right direction here?