Hugging Face Forums

"too many open files" despite streaming with IterableDataset

Aceticia January 27, 2025, 7:51pm 1

Hi all, I have been using IterableDataset to load a very large collection of .arrow shards (about 80k, each at 500MB). I load them with d=load_dataset("arrow", data_files=xxx, streaming=True), and then I would shuffle with buffer size of 10_000.

However this caused too many open files OS error when using it for training. How is this the case? My understanding was that streaming would prevent all files from being loaded at the same time, thus avoiding too many open files issue?

1 Like

Aceticia January 27, 2025, 8:53pm 2

Turns out the buffer size is the culprit. I guess the best way is to pre-shuffle each shard.

1 Like

system Closed January 28, 2025, 8:54am 3

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views	Activity
“too many open files” despite streaming with IterableDataset 🤗Datasets	2	52	January 30, 2025
Too many open files on big datasets 🤗Datasets	3	190	September 30, 2024
OOM issue with large dataset streaming 🤗Datasets	6	126	March 15, 2025
Num_worker with IterableDataset 🤗Datasets	4	2739	November 16, 2023
Limitations of iterable datasets 🤗Datasets	11	5571	June 28, 2024