Does masked language model training script does random shuffle on the dataset?

I haven’t seen this been explicitly defined in the code, although in the “no trainer” script the shuffling is defined. Internally in datasets the shuffling is done or is it missing in the MLM with trainer?
If it’s missing, I can contribute with a PR :slight_smile:

Shuffle is not enabled in the default dataloaders in the trainer. If you want to add a PR, you need to add an argument to the TrainingArguments and update the Trainer dataloaders.

See comment by @sgugger below.

Hi @Emanuel,

I’m not sure about your setup, so could you run the following after the trainer is initialized and copy-and-paste the value here:

trainer.get_train_dataloader().sampler

That is incorrect. The training dataloader is always defined with shuffle=True (more precisely with a random sampler because we have to handle distributed training, but that’s the same as not passing a sampler and pass shuffle=True).

2 Likes

My bad, I only checked the dataloader definition but missed the sampler.

1 Like