“too many open files” despite streaming with IterableDataset

Aceticia · January 29, 2025, 3:36pm

Hi all, I have been using IterableDataset to load a very large collection of .arrow shards (~8k files per GPU, 24 GPUs, each at 1GB). I load them with d=load_dataset("arrow", data_files=xxx, streaming=True).

However this caused too many open files OS error when using it for training. How is this the case? My understanding was that streaming would prevent all files from being loaded at the same time, thus avoiding too many open files issue?

(Apologies for a nearly duplicate problem. I thought I fixed the problem but it turns out I was only testing on a smaller subset of the data and that solved the problem. Since the original question is now locked, I’ll repost this question again. Thanks for understanding!)

John6666 · January 30, 2025, 6:39am

One possible cause is that there are an incredible number of parallel processes?

github.com/huggingface/datasets

OSError: [Errno 24] Too many open files

opened 01:15AM - 07 May 24 UTC

closed 01:01PM - 13 May 24 UTC

loicmagne

bug

### Describe the bug I am trying to load the 'default' subset of the following …dataset which contains lots of files (828 per split): [https://huggingface.co/datasets/mteb/biblenlp-corpus-mmteb](https://huggingface.co/datasets/mteb/biblenlp-corpus-mmteb) When trying to load it using the `load_dataset` function I get the following error ```python >>> from datasets import load_dataset >>> d = load_dataset('mteb/biblenlp-corpus-mmteb') Downloading readme: 100%|████████████████████████████████████████████████████████████████████████| 201k/201k [00:00<00:00, 1.07MB/s] Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 1069.15it/s] Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 436182.33it/s] Resolving data files: 100%|█████████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 2228.75it/s] Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 646478.73it/s] Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 831032.24it/s] Resolving data files: 100%|███████████████████████████████████████████████████████████████████| 828/828 [00:00<00:00, 517645.51it/s] Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:33<00:00, 24.87files/s] Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 27.48files/s] Downloading data: 100%|████████████████████████████████████████████████████████████████████████| 828/828 [00:30<00:00, 26.94files/s] Generating train split: 1571592 examples [00:03, 461438.97 examples/s] Generating test split: 11163 examples [00:00, 118190.72 examples/s] Traceback (most recent call last): File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1995, in _prepare_split_single for _, table in generator: File ".env/lib/python3.12/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables with open(file, "rb") as f: ^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/datasets/streaming.py", line 75, in wrapper return function(*args, download_config=download_config, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 1224, in xopen file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open return self.__enter__() ^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__ f = self.fs.open(self.path, mode=mode) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open f = self._open( ^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/datasets/filesystems/compression.py", line 81, in _open return self.file.open() ^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/core.py", line 135, in open return self.__enter__() ^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/core.py", line 103, in __enter__ f = self.fs.open(self.path, mode=mode) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/spec.py", line 1293, in open f = self._open( ^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 197, in _open return LocalFileOpener(path, mode, fs=self, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 322, in __init__ self._open() File ".env/lib/python3.12/site-packages/fsspec/implementations/local.py", line 327, in _open self.f = open(self.path, mode=self.mode) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/downloads/3a347186abfc0f9c924dde0221d246db758c7232c0101523f04a87c17d696618' The above exception was the direct cause of the following exception: Traceback (most recent call last): File ".env/lib/python3.12/site-packages/datasets/builder.py", line 981, in incomplete_dir yield tmp_dir File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1027, in download_and_prepare self._download_and_prepare( File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1122, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1882, in _prepare_split for job_id, done, content in self._prepare_split_single( File ".env/lib/python3.12/site-packages/datasets/builder.py", line 2038, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".env/lib/python3.12/site-packages/datasets/load.py", line 2609, in load_dataset builder_instance.download_and_prepare( File ".env/lib/python3.12/site-packages/datasets/builder.py", line 1007, in download_and_prepare with incomplete_dir(self._output_dir) as tmp_output_dir: File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__ self.gen.throw(value) File ".env/lib/python3.12/site-packages/datasets/builder.py", line 988, in incomplete_dir shutil.rmtree(tmp_dir) File "/usr/lib/python3.12/shutil.py", line 785, in rmtree _rmtree_safe_fd(fd, path, onexc) File "/usr/lib/python3.12/shutil.py", line 661, in _rmtree_safe_fd onexc(os.scandir, path, err) File "/usr/lib/python3.12/shutil.py", line 657, in _rmtree_safe_fd with os.scandir(topfd) as scandir_it: ^^^^^^^^^^^^^^^^^ OSError: [Errno 24] Too many open files: '.cache/huggingface/datasets/mteb___biblenlp-corpus-mmteb/default/0.0.0/3912ed967b0834547f35b2da9470c4976b357c9a.incomplete' ``` I looked for the maximum number of open files on my machine (Ubuntu 24.04) and it seems to be 1024, but even when I try to load a single split (`load_dataset('mteb/biblenlp-corpus-mmteb', split='train')`) I get the same error ### Steps to reproduce the bug ```python from datasets import load_dataset d = load_dataset('mteb/biblenlp-corpus-mmteb') ``` ### Expected behavior Load the dataset without error ### Environment info - `datasets` version: 2.19.0 - Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39 - Python version: 3.12.3 - `huggingface_hub` version: 0.23.0 - PyArrow version: 16.0.0 - Pandas version: 2.2.2 - `fsspec` version: 2024.3.1

system · February 7, 2025, 9:18pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"too many open files" despite streaming with IterableDataset 🤗Datasets	2	28	January 27, 2025
Too many open files on big datasets 🤗Datasets	3	185	September 30, 2024
Num_worker with IterableDataset 🤗Datasets	4	2720	November 16, 2023
Training with IterableDataset is very slow when using a large number of workers 🤗Transformers	0	1294	August 19, 2023
Big text dataset loading for training 🤗Datasets	2	110	May 7, 2025

“too many open files” despite streaming with IterableDataset

Related topics