Add data augmentation process during training every epoch

Hello,

I’d like to process my training dataset every epoch.
I want to add random processing as data augmentation, and I want to do it during training, not preprocessing.

I think I can do it with Trainer, DataCollator, or __getitem__ of datasets.arrow_dataset.Dataset, but where should I do it?

For the evaluation set and test set, I plan to do a preprocess using datasets.arrow_dataset.Dataset.map.

Thank you in advance.

The DataCollator can help if you have something randomized in the call that returns the batch. A getitem in your Dataset can also help, it all depends on what you are trying to do exactly.

The Trainer in itself has nothing implemented for data augmentation, so it won’t help you.

2 Likes

Hi @sgugger

Thank you for the answer!
I now understand for my purpose I should treat a DataCollator or getitem in a Dataset, depending on what I am trying to do, and the Trainer is not an appropriate place in which data augmentation is done.

Referring to your answer in my another question How to use Seq2SeqTrainer (Seq2SeqDataCollator) in v4.2.1 - #5 by sgugger,
I’ll consider the best way to implement to get something randomized in my case.

Thank you so much!

2 Likes