Streaming Dataset of Sequence Length 2048

  • When training a model on the lvwerra/codeparrot-clean-train dataset, we don’t get these spikes even without shuffling the sequences inside the batches but we’re using a different training setting that is supposed to be more stable and implements some of the ideas you mentioned, you can find the code here transformers/codeparrot_training.py at main · huggingface/transformers · GitHub . We also use file concatenation and sequence splitting without padding.
  • As for the shuffling of a torch IterableDataset, you can create a ShuffledDataset class to which you pass your IterableDataset like here How to shuffle an iterable dataset - #6 by sharvil - PyTorch Forums Or use combinatorics.ShufflerIterDataPipe(IterableDataset, buffer_size) from torch.utils.data.datapipes.iter which I think is supposed to do the same thing
1 Like