Non shuffle training

Hi there,

In order to debug something I need to make data non-shuffle. Can you please tell me how to turn off the shuffle?

I am using from transformers import Trainer for training and from datasets import load_dataset for data loading with default arguments.

1 Like

There is no option to do this natively in the Trainer, you can either make a source install and change the line that creates the training dataloader, or subclass Trainer and override the get_train_dataloader method.

3 Likes

Refer to thread here: How to ensure the dataset is shuffled for each epoch using Trainer and Datasets? - #3 by lhoestq

As non-shuffling is a minor demand from users, the Trainer class doesn’t provide this option to avoid careless users making mistakes.

where should i change to make it non-shuffled?

For the reference, the way to go was to switch train sampler. E.g. this worked for me for SFTTrainer, should be the same for vanilla Trainer. You inherit from initial class and then use your new redefined one.

from torch.utils.data import SequentialSampler
from trl import SFTTrainer

class SFTTrainer2(SFTTrainer):
    def _get_train_sampler(self):
        return SequentialSampler(self.train_dataset)

Hi,

It would look like so:

from transformers import Trainer
from torch.utils.data import DataLoader

class MyCustomTrainer(Trainer):
    def get_train_dataloader(self) -> DataLoader:
       train_dataset = self.train_dataset
       return DataLoader(train_dataset, shuffle=False, batch_size=...)

Note that this is just a minimal version for demo purposes, refer to the source code if you want to tweak it more.