Limitations of iterable datasets

Hi @conceptofmind

Thanks for pointing out your experiments to me and some tools which could help me out.
In the meantime I wrote custom datasets and data collators for HF/Pytorch to use memory mapped arrow tables and tokenize on the fly. This has fixed most of my issues, i.e. good convergence and moderate RAM use.

I will try using HF’s streaming datasets with ShufflerIterDataPipe and see if it behaves well while reducing even more RAM use !

Best.