Too many open files on big datasets

Hi all, I have a pretty large dataset which I split into many partitions, with each partition consisting of about 200 of 1G files using datasets. An issue I run into when loading is that when I try to load_from_disk all these different datasets together, I get “too many open files” error. I can’t increase my file limit.

My question is: Other than rerunning my entire dataset and storing them using bigger chunks like 50G (which will take a long time), what could be a good solution to solve this problem?

1 Like

It seems to be a Linux-specific problem. I’m a Windows user, so I’m not sure, but there seem to be a couple of workarounds.

2 Likes

This is helpful, but unfortunately I can’t use the solutions there since I don’t have root access. After much thought, I guess the only two solutions are:

  1. Remake my dataset
  2. Load random subsets of my dataset for each epoch

I see. Even in that case, there seems to be easier way. Not sure if it would work…

ds = load_from_disk('path/to/dataset/directory', keep_in_memory=True) # RAM consumption looks terrible... But won't disk access decrease?