"Too many open files" when loading Common Voice

Ollie · January 27, 2022, 10:37pm

I’m trying to load the Common Voice dataset and I’m coming across OSError: [Errno 24] Too many open files.

There’s only one line of code: ds = datasets.load_dataset("common_voice", "en", split="train+validation", version="6.1.0", cache_dir="gcs-data/common-voice") but it might be worth mentioning that cache_dir is a mounted cloud storage path.

The error occurs when the dataset finalizes and the temporary storage folder containing the arrow tables is renamed.

I’m running Ubuntu with 32GB of RAM. ulimit -S and ulimit -H are both unlimited.

Thanks in advance!

lhoestq · January 31, 2022, 4:04pm

Hi ! Can you post the full stack trace please ?
You can also try with a smaller configuration of Common Voice like “ab” instead of “en” to investigate

Ollie · February 6, 2022, 6:47am

Sorry for the late reply!

This is the traceback I’m getting:

Downloading: 168kB [00:00, 73.5MB/s]                                                                                                                                                                   
Downloading and preparing dataset common_voice/ab (download: 39.14 MiB, generated: 40.14 MiB, post-processed: Unknown size, total: 79.28 MiB) to gcs-data/dummy-cv-folder/common_voice/ab/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd...
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41.0M/41.0M [00:04<00:00, 10.2MB/s]
Traceback (most recent call last):      
  File "download_common_voice.py", line 2, in <module>
    ds = datasets.load_dataset("common_voice", "ab", split="train+validation", version="6.1.0", cache_dir="gcs-data/dummy-cv-folder")
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1699, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 603, in download_and_prepare
    self._save_info()
  File "/opt/conda/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 557, in incomplete_dir
    os.rename(tmp_dir, dirname)
OSError: [Errno 24] Too many open files: 'gcs-data/dummy-cv-folder/common_voice/ab/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd.incomplete' -> 'gcs-data/dummy-cv-folder/common_voice/ab/6.1.0/5693bfc0feeade582a78c2fb250bc88f52bd86f0a7f1bb22bfee67e715de30fd'

My datasets version is 1.18.1.

lhoestq · February 7, 2022, 8:50pm

I’ve not seen this kind of issues before, it might happen because this uses a mounted GCS bucket.
I think you might need to load the dataset on your local disk first, and then move it to your mounted GCS bucket with my_dataset.save_to_disk("path/to/gcs/bucket"). Later you can reload it with

from datasets import load_from_disk

my_dataset = load_from_disk("path/to/gcs/bucket")

Ollie · February 8, 2022, 11:56am

Yip, that worked. Also cp’ing the temp arrow files to a different folder before cleanup (and exception) worked too.

Topic		Replies	Views
“too many open files” despite streaming with IterableDataset 🤗Datasets	2	50	January 30, 2025
Common Voice 8.0.0 en using all available RAM 🤗Datasets	7	907	August 5, 2022
Could not load common_voice dataset 🤗Datasets	1	266	December 15, 2023
Too many open files on big datasets 🤗Datasets	3	190	September 30, 2024
Common voice dataset 15.0 version release 🤗Datasets	1	1244	October 3, 2023

"Too many open files" when loading Common Voice

Related topics