Streaming Dataset of Sequence Length 2048

loubnabnl · May 12, 2022, 4:40pm

When training a model on the lvwerra/codeparrot-clean-train dataset, we don’t get these spikes even without shuffling the sequences inside the batches but we’re using a different training setting that is supposed to be more stable and implements some of the ideas you mentioned, you can find the code here transformers/codeparrot_training.py at main · huggingface/transformers · GitHub . We also use file concatenation and sequence splitting without padding.
As for the shuffling of a torch IterableDataset, you can create a ShuffledDataset class to which you pass your IterableDataset like here How to shuffle an iterable dataset - #6 by sharvil - PyTorch Forums Or use combinatorics.ShufflerIterDataPipe(IterableDataset, buffer_size) from torch.utils.data.datapipes.iter which I think is supposed to do the same thing

Topic		Replies	Views
Training a Tokenizer on a Streamed Dataset Beginners	5	1342	May 30, 2023
Map with tokenize function stuck in the beginning 🤗Datasets	4	57	December 27, 2024
Building a GPT2 dataset from long sequences 🤗Datasets	1	517	September 19, 2022
Make text data continuous from DatasetDict 🤗Datasets	1	1180	May 11, 2022
Getting correct length via DataLoader and speed 🤗Datasets	4	449	April 5, 2024