OOM issue with large dataset streaming

mayurdbs · March 6, 2025, 3:56pm

For training, we load a multiple datasets (each with 1000+ 1GB+ each arrow files saved using dataset) with load_dataset(…streaming=true) and then interleave them together with interleave_dataset.

Even with streaming dataset we are seeing OOM at the interleaving_datasets step. Note no data has been read from datasets yet. Sharding the initial datasets per gpu before interleaving alleviated the memory pressure.

Why should there be such memory issues with streaming dataset?

John6666 · March 6, 2025, 4:06pm

Is the data set file itself too large…?
If it seems like a bug, let’s call lhonestq.

mayurdbs · March 6, 2025, 4:22pm

With each load_dataset() we are loading ~1000 arrow files of ~ 1GB each. 15 such datasets are interleaved together.

lhoestq · March 11, 2025, 3:27pm

if the iterable datasets have unknown feature types, interleave_datasets streams the first rows of the datasets to check the types and see whether the datasets can be interleaved without issues. But still this uses little memory (maximum one arrow record batch per dataset) so I can’t tell if this is the source of your OOM

mayurdbs · March 13, 2025, 6:23pm

is HF datasets recommended (streaming or otherwise) for such large scale dataset training?

lhoestq · March 14, 2025, 5:24pm

Yep, especially streaming which is meant exactly for fast, large scale data loading

mayurdbs · March 15, 2025, 3:48am

No, I meant given that iterable dataset is failing with large dataset size, has it been tried and tested to work with very large scale datasets?

Topic		Replies	Views
How to load a large hf dataset efficiently? 🤗Datasets	5	2367	January 22, 2024
Streaming in dataset uploads 🤗Datasets	2	50	March 31, 2025
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	27	January 27, 2025
Using large dataset with accelerate 🤗Accelerate	0	43	March 6, 2025
Recommended max size of dataset? 🤗Datasets	5	128	March 11, 2025

OOM issue with large dataset streaming

Related topics