Limitations of iterable datasets

Hi Mario and thanks for your reply

I think I am getting set with the two first points, I did not observe code slowing down much when passing iterable datasets or datasets with streaming = False.

About the 3rd point, I think I will go with the option to replicate examples as a pre-processing step, which is the most easy. But to clarify, my question was to handle the case where I have dataset e.g. (x1, x2, …, xN) and I would like to train without seeing each x one time in an epoch. Imagine some samples are e.g. harder than others, or belong to more or less represented clusters, I could then over-sample these if I provide (p1, p2, …, pN) and sample mini-batches according to a certain probability which is e.g. increased for harder or less represented examples.

Still right now I am having a bit of issues working my way through achieving equivalent results with or without streaming datasets.

For others who may see this thread, I had issues to run HF trainer with iterable datasets because at first I haven’t noticed that HF iterable datasets (returned by load_dataset(…, streaming=True)) are not supported by PyTorch and I need to call dataset = dataset.with_format(“torch”) after applying map and before passing to trainer.

About current points I am having issues, in case you may have some hints for me please:

_ training loss curves decrease smoothly with streaming=False but currently with iterable datasets losses do not converge smoothly and even tend to diverge … I am still debugging and have not identify all possible causes, as far as I can tell, differences happening in between streaming=False/True is that for streaming I cannot use the group_by_length training option … apart from that I did not notice any other differences … am I missing some specific things to manually take care of e.g. shuffling, when using iterable datasets with HF trainer ?

_ to evaluate the model, either during training with e.g. evaluation_strategy = epoch and at the end of training with e.g. metrics = trainer.evaluate() ; I read that there are issues as the length of the evaluation/test datasets should be known in advance … are there some regular ways to perform evaluation on iterable datasets, such as callbacks, or should I e.g. use streaming dataset for train and keep in memory eval/test splits ?

our servers have rather large RAM of 1.5TB, so I could actually load my datasets in memory
but I observed that parallel runs on very large datasets (e.g. 500M training examples) take a lot of time to initialise training, i.e. when calling trainer.train
this is actually longer than the time for dataset preprocessing e.g. tokenization
again, this is to take with a pinch of salt and maybe debugging work on my side will solve this, and any hint for me will be very appreciated !

best,
A