Add data augmentation process during training every epoch

yusukemori · January 19, 2021, 5:46am

Hello,

I’d like to process my training dataset every epoch.
I want to add random processing as data augmentation, and I want to do it during training, not preprocessing.

I think I can do it with Trainer, DataCollator, or __getitem__ of datasets.arrow_dataset.Dataset, but where should I do it?

For the evaluation set and test set, I plan to do a preprocess using datasets.arrow_dataset.Dataset.map.

Thank you in advance.

sgugger · January 19, 2021, 9:14pm

The DataCollator can help if you have something randomized in the call that returns the batch. A getitem in your Dataset can also help, it all depends on what you are trying to do exactly.

The Trainer in itself has nothing implemented for data augmentation, so it won’t help you.

yusukemori · January 20, 2021, 3:52am

Hi @sgugger

Thank you for the answer!
I now understand for my purpose I should treat a DataCollator or getitem in a Dataset, depending on what I am trying to do, and the Trainer is not an appropriate place in which data augmentation is done.

Referring to your answer in my another question How to use Seq2SeqTrainer (Seq2SeqDataCollator) in v4.2.1 - #5 by sgugger,
I’ll consider the best way to implement to get something randomized in my case.

Thank you so much!

Topic		Replies	Views
Preprocessing & augmentation during training Beginners	3	1457	February 10, 2022
Datasets - how to add augmentations? 🤗Datasets	1	597	October 25, 2023
DataCollator for selecting a random subset and permutation Beginners	0	588	July 20, 2023
Fine-tuning image classification with data augmentation using Trainer Beginners	0	1101	April 21, 2023
How to ensure the dataset is shuffled for each epoch using Trainer and Datasets? 🤗Transformers	13	19470	April 10, 2025

Add data augmentation process during training every epoch

Related topics