Iβm trying to load the Common Voice dataset and Iβm coming across OSError: [Errno 24] Too many open files.
Thereβs only one line of code: ds = datasets.load_dataset("common_voice", "en", split="train+validation", version="6.1.0", cache_dir="gcs-data/common-voice") but it might be worth mentioning that cache_dir is a mounted cloud storage path.
The error occurs when the dataset finalizes and the temporary storage folder containing the arrow tables is renamed.
Iβm running Ubuntu with 32GB of RAM. ulimit -S and ulimit -H are both unlimited.
Hi ! Can you post the full stack trace please ?
You can also try with a smaller configuration of Common Voice like βabβ instead of βenβ to investigate
Iβve not seen this kind of issues before, it might happen because this uses a mounted GCS bucket.
I think you might need to load the dataset on your local disk first, and then move it to your mounted GCS bucket with my_dataset.save_to_disk("path/to/gcs/bucket"). Later you can reload it with
from datasets import load_from_disk
my_dataset = load_from_disk("path/to/gcs/bucket")