β€œtoo many open files” despite streaming with IterableDataset

Hi all, I have been using IterableDataset to load a very large collection of .arrow shards (~8k files per GPU, 24 GPUs, each at 1GB). I load them with d=load_dataset("arrow", data_files=xxx, streaming=True).

However this caused too many open files OS error when using it for training. How is this the case? My understanding was that streaming would prevent all files from being loaded at the same time, thus avoiding too many open files issue?

(Apologies for a nearly duplicate problem. I thought I fixed the problem but it turns out I was only testing on a smaller subset of the data and that solved the problem. Since the original question is now locked, I’ll repost this question again. Thanks for understanding!)

1 Like

One possible cause is that there are an incredible number of parallel processes?

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.