How to ensure the dataset is shuffled for each epoch using Trainer and Datasets?

I am using the Seq2SeqTrainer and pass an datasets.arrow_dataset.Dataset as train_dataset when initiating the object. Is the dataset by default shuffled per epoch? If not, how to make it shuffled?

An example is from the official example: transformers/run_seq2seq.py at master · huggingface/transformers · GitHub

Thanks!

Still needs help…

The Seq2SeqTrainer (as well as the standard Trainer) uses a PyTorch Sampler to shuffle the dataset. At each epoch, it does shuffle the dataset and it also groups the samples of roughly the same length size. You can find the Sampler definition here.

2 Likes

Hi, Is there a parameters that controls whether or not the data get reshuffled before each epoch? And whether or not it is grouped by length? Thanks!

Additionally, if the training is aborted and I’m restarting from a checkpoint - does the checkpoint have information about the shuffling order for this given epoch and which datapoints still haven’t gone through this epoch already? Thanks!

No, this is would be very bad practice so we don’t offer that option.

That would be the group_by_length argument.

Yes training will resume with the same shuffle, at the same point you were at the time of the save.

thank you!