"Too many open files" when loading Common Voice

I’m trying to load the Common Voice dataset and I’m coming across OSError: [Errno 24] Too many open files.

There’s only one line of code: ds = datasets.load_dataset("common_voice", "en", split="train+validation", version="6.1.0", cache_dir="gcs-data/common-voice") but it might be worth mentioning that cache_dir is a mounted cloud storage path.

The error occurs when the dataset finalizes and the temporary storage folder containing the arrow tables is renamed.

I’m running Ubuntu with 32GB of RAM. ulimit -S and ulimit -H are both unlimited.

Thanks in advance!

Hi ! Can you post the full stack trace please ?
You can also try with a smaller configuration of Common Voice like β€œab” instead of β€œen” to investigate

Sorry for the late reply!

This is the traceback I’m getting:

Downloading: 168kB [00:00, 73.5MB/s]                                                                                                                                                                   
Downloading and preparing dataset common_voice/ab (download: 39.14 MiB, generated: 40.14 MiB, post-processed: Unknown size, total: 79.28 MiB) to gcs-data/dummy-cv-folder/common_voice/ab/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd...
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 41.0M/41.0M [00:04<00:00, 10.2MB/s]
Traceback (most recent call last):      
  File "download_common_voice.py", line 2, in <module>
    ds = datasets.load_dataset("common_voice", "ab", split="train+validation", version="6.1.0", cache_dir="gcs-data/dummy-cv-folder")
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1699, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 603, in download_and_prepare
    self._save_info()
  File "/opt/conda/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 557, in incomplete_dir
    os.rename(tmp_dir, dirname)
OSError: [Errno 24] Too many open files: 'gcs-data/dummy-cv-folder/common_voice/ab/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd.incomplete' -> 'gcs-data/dummy-cv-folder/common_voice/ab/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd'

My datasets version is 1.18.1.

I’ve not seen this kind of issues before, it might happen because this uses a mounted GCS bucket.
I think you might need to load the dataset on your local disk first, and then move it to your mounted GCS bucket with my_dataset.save_to_disk("path/to/gcs/bucket"). Later you can reload it with

from datasets import load_from_disk

my_dataset = load_from_disk("path/to/gcs/bucket")

Yip, that worked. Also cp’ing the temp arrow files to a different folder before cleanup (and exception) worked too.