I am using the Seq2SeqTrainer and pass an datasets.arrow_dataset.Dataset as train_dataset when initiating the object. Is the dataset by default shuffled per epoch? If not, how to make it shuffled?
The Seq2SeqTrainer (as well as the standard Trainer) uses a PyTorch Sampler to shuffle the dataset. At each epoch, it does shuffle the dataset and it also groups the samples of roughly the same length size. You can find the Sampler definition here.
Hi, Is there a parameters that controls whether or not the data get reshuffled before each epoch? And whether or not it is grouped by length? Thanks!
Additionally, if the training is aborted and I’m restarting from a checkpoint - does the checkpoint have information about the shuffling order for this given epoch and which datapoints still haven’t gone through this epoch already? Thanks!
Hi Sgugger, why is it a bad practice to reshuffle the dataset at every epoch?
I thought reshuffle the dataset at every epoch can reduce overfitting and improve the generalization performance of the model. By shuffling the dataset, we ensure that the model is exposed to a different sequence of samples in each epoch, which can help to prevent it from memorizing the order of the training data and overfitting to specific patterns.
Shuffling the dataset also helps to improve the diversity of the mini-batches during training, which can improve the robustness of the model and make it more resistant to outliers or noise in the data.