Limitations of iterable datasets

adrienchaton · May 13, 2022, 5:26am

Thanks for pointing out your experiments to me and some tools which could help me out.
In the meantime I wrote custom datasets and data collators for HF/Pytorch to use memory mapped arrow tables and tokenize on the fly. This has fixed most of my issues, i.e. good convergence and moderate RAM use.

I will try using HF’s streaming datasets with ShufflerIterDataPipe and see if it behaves well while reducing even more RAM use !

Best.

Topic		Replies	Views
Roadmap/timeline for dataset streaming 🤗Datasets	9	2287	July 5, 2021
Num_worker with IterableDataset 🤗Datasets	4	2998	November 16, 2023
Streaming Dataset of Sequence Length 2048 Intermediate	7	2855	May 12, 2022
Streaming datasets and batched mapping 🤗Datasets	5	2707	January 10, 2022
Prevent iterable dataset from consuming all the rams Beginners	2	48	June 24, 2025

Limitations of iterable datasets

Related topics