Too many open files on big datasets

Aceticia · September 27, 2024, 11:22pm

Hi all, I have a pretty large dataset which I split into many partitions, with each partition consisting of about 200 of 1G files using datasets. An issue I run into when loading is that when I try to load_from_disk all these different datasets together, I get “too many open files” error. I can’t increase my file limit.

My question is: Other than rerunning my entire dataset and storing them using bigger chunks like 50G (which will take a long time), what could be a good solution to solve this problem?

John6666 · September 29, 2024, 7:14am

It seems to be a Linux-specific problem. I’m a Windows user, so I’m not sure, but there seem to be a couple of workarounds.

Aceticia · September 30, 2024, 8:33pm

This is helpful, but unfortunately I can’t use the solutions there since I don’t have root access. After much thought, I guess the only two solutions are:

Remake my dataset
Load random subsets of my dataset for each epoch

John6666 · September 30, 2024, 10:45pm

I see. Even in that case, there seems to be easier way. Not sure if it would work…

ds = load_from_disk('path/to/dataset/directory', keep_in_memory=True) # RAM consumption looks terrible... But won't disk access decrease?

Topic		Replies	Views
“too many open files” despite streaming with IterableDataset 🤗Datasets	2	72	January 30, 2025
"Too many open files" when loading Common Voice 🤗Datasets	4	1391	February 8, 2022
Working with large datasets 🤗Datasets	5	4220	November 10, 2020
Multiple call datasets.load_from_disk() cause Memory Leak! 🤗Datasets	2	1224	February 15, 2022
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	32	January 27, 2025

Too many open files on big datasets

Related topics