Limitations of iterable datasets

adrienchaton · April 25, 2022, 12:32pm

I am actually having an idea why the loss would behave differently in streaming and non-streaming mode, it would be great if you could confirm please.
When I am training with streaming (i.e. iterable dataset), the logger only sees one epoch which is the chosen number of training steps.
Then I am afraid there is no reshuffling of the dataset during training … am I right ?

Question here is what is the best way to fix this please ?
Is there a place where I should configure the length of the dataset, which is known in advance in my case ?
Or should I make a callback every length dataset / batch size to manually shuffle the dataset ?

Topic		Replies	Views
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1567	April 11, 2023
Num_worker with IterableDataset 🤗Datasets	4	2700	November 16, 2023
How to handle IterableDataset with HuggingFace trainer and num_workers in DDP setup 🤗Datasets	5	2980	September 26, 2024
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Dataset.map stuck with `torch.set_num_threads` set to 2 or larger Beginners	1	1658	May 2, 2023

Limitations of iterable datasets

Related topics