Limitations of iterable datasets

adrienchaton · April 22, 2022, 10:03am

Hi Mario and thanks for your reply

I think I am getting set with the two first points, I did not observe code slowing down much when passing iterable datasets or datasets with streaming = False.

About the 3rd point, I think I will go with the option to replicate examples as a pre-processing step, which is the most easy. But to clarify, my question was to handle the case where I have dataset e.g. (x1, x2, …, xN) and I would like to train without seeing each x one time in an epoch. Imagine some samples are e.g. harder than others, or belong to more or less represented clusters, I could then over-sample these if I provide (p1, p2, …, pN) and sample mini-batches according to a certain probability which is e.g. increased for harder or less represented examples.

Still right now I am having a bit of issues working my way through achieving equivalent results with or without streaming datasets.

For others who may see this thread, I had issues to run HF trainer with iterable datasets because at first I haven’t noticed that HF iterable datasets (returned by load_dataset(…, streaming=True)) are not supported by PyTorch and I need to call dataset = dataset.with_format(“torch”) after applying map and before passing to trainer.

About current points I am having issues, in case you may have some hints for me please:

_ training loss curves decrease smoothly with streaming=False but currently with iterable datasets losses do not converge smoothly and even tend to diverge … I am still debugging and have not identify all possible causes, as far as I can tell, differences happening in between streaming=False/True is that for streaming I cannot use the group_by_length training option … apart from that I did not notice any other differences … am I missing some specific things to manually take care of e.g. shuffling, when using iterable datasets with HF trainer ?

_ to evaluate the model, either during training with e.g. evaluation_strategy = epoch and at the end of training with e.g. metrics = trainer.evaluate() ; I read that there are issues as the length of the evaluation/test datasets should be known in advance … are there some regular ways to perform evaluation on iterable datasets, such as callbacks, or should I e.g. use streaming dataset for train and keep in memory eval/test splits ?

our servers have rather large RAM of 1.5TB, so I could actually load my datasets in memory
but I observed that parallel runs on very large datasets (e.g. 500M training examples) take a lot of time to initialise training, i.e. when calling trainer.train
this is actually longer than the time for dataset preprocessing e.g. tokenization
again, this is to take with a pinch of salt and maybe debugging work on my side will solve this, and any hint for me will be very appreciated !

best,
A

Topic		Replies	Views
Roadmap/timeline for dataset streaming 🤗Datasets	9	2287	July 5, 2021
Num_worker with IterableDataset 🤗Datasets	4	2996	November 16, 2023
Streaming Dataset of Sequence Length 2048 Intermediate	7	2855	May 12, 2022
Streaming datasets and batched mapping 🤗Datasets	5	2707	January 10, 2022
Prevent iterable dataset from consuming all the rams Beginners	2	48	June 24, 2025

Limitations of iterable datasets

Related topics