- When training a model on the
lvwerra/codeparrot-clean-train
dataset, we don’t get these spikes even without shuffling the sequences inside the batches but we’re using a different training setting that is supposed to be more stable and implements some of the ideas you mentioned, you can find the code here transformers/codeparrot_training.py at main · huggingface/transformers · GitHub . We also use file concatenation and sequence splitting without padding. - As for the shuffling of a torch IterableDataset, you can create a ShuffledDataset class to which you pass your IterableDataset like here How to shuffle an iterable dataset - #6 by sharvil - PyTorch Forums Or use
combinatorics.ShufflerIterDataPipe(IterableDataset, buffer_size)
fromtorch.utils.data.datapipes.iter
which I think is supposed to do the same thing
1 Like