Limitations of iterable datasets

conceptofmind · May 12, 2022, 11:25pm

I had noticed something similar as well with spiking convergence when training with streamed data and an Iterable Dataset vs a non-streamed non-iterable local dataset.

It may be worth checking out whether using ShufflerIterDataPipe() to shuffle the batches in the Iterable data loader will help to resolve your issue.

For example something like this:

from torch.utils.data.datapipes.iter.combinatorics import ShufflerIterDataPipe

shuffled_batches = ShufflerIterDataPipe(your_torch_dataset)

train_dataloader = DataLoader(shuffle_batches, shuffle = True, batch_size = 8)

I have been working through it with the Hugging Face team and documenting my results in this thread: Streaming Dataset of Sequence Length 2048 - #7 by loubnabnl

Hope this will help.

Best.

Topic		Replies	Views
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1566	April 11, 2023
Num_worker with IterableDataset 🤗Datasets	4	2691	November 16, 2023
How to handle IterableDataset with HuggingFace trainer and num_workers in DDP setup 🤗Datasets	5	2963	September 26, 2024
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Dataset.map stuck with `torch.set_num_threads` set to 2 or larger Beginners	1	1658	May 2, 2023

Limitations of iterable datasets

Related topics