Hi all, I have a pretty large dataset which I split into many partitions, with each partition consisting of about 200 of 1G files using datasets. An issue I run into when loading is that when I try to load_from_disk all these different datasets together, I get “too many open files” error. I can’t increase my file limit.
My question is: Other than rerunning my entire dataset and storing them using bigger chunks like 50G (which will take a long time), what could be a good solution to solve this problem?
This is helpful, but unfortunately I can’t use the solutions there since I don’t have root access. After much thought, I guess the only two solutions are: