For training, we load a multiple datasets (each with 1000+ 1GB+ each arrow files saved using dataset) with load_dataset(…streaming=true) and then interleave them together with interleave_dataset.
Even with streaming dataset we are seeing OOM at the interleaving_datasets step. Note no data has been read from datasets yet. Sharding the initial datasets per gpu before interleaving alleviated the memory pressure.
Why should there be such memory issues with streaming dataset?
4 Likes
Is the data set file itself too large…?
If it seems like a bug, let’s call lhonestq.
With each load_dataset() we are loading ~1000 arrow files of ~ 1GB each. 15 such datasets are interleaved together.
1 Like
if the iterable datasets have unknown feature types, interleave_datasets
streams the first rows of the datasets to check the types and see whether the datasets can be interleaved without issues. But still this uses little memory (maximum one arrow record batch per dataset) so I can’t tell if this is the source of your OOM
1 Like
is HF datasets recommended (streaming or otherwise) for such large scale dataset training?
1 Like