Iām trying to load the Common Voice dataset and Iām coming across OSError: [Errno 24] Too many open files
.
Thereās only one line of code: ds = datasets.load_dataset("common_voice", "en", split="train+validation", version="6.1.0", cache_dir="gcs-data/common-voice")
but it might be worth mentioning that cache_dir
is a mounted cloud storage path.
The error occurs when the dataset finalizes and the temporary storage folder containing the arrow tables is renamed.
Iām running Ubuntu with 32GB of RAM. ulimit -S
and ulimit -H
are both unlimited.
Thanks in advance!