Hi all, I have been using IterableDataset to load a very large collection of .arrow
shards (about 80k, each at 500MB). I load them with d=load_dataset("arrow", data_files=xxx, streaming=True)
, and then I would shuffle with buffer size of 10_000.
However this caused too many open files
OS error when using it for training. How is this the case? My understanding was that streaming would prevent all files from being loaded at the same time, thus avoiding too many open files issue?