Load_dataset hangs with local files

fordaz · December 17, 2022, 7:11pm

I’m trying to load a local dataset using load_dataset. After invoking load_dataset it hangs forever, it never finishes. Here are the details:

Environment:

Python 3.9.12 (main, Apr  5 2022, 01:53:17)
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin

conda 22.9.0

datasets==2.7.0

Local data files

The contents of the data folder is:

./dataset/disaster (relative directory)
    train.csv
    validation.csv
    test.csv

Python code

Using the following python code in test_local_load.py

from datasets import load_dataset

try:
    disaster = load_dataset("./dataset/disaster/")
    print(f" the type {type(disaster)}\n{disaster}")
except Exception as e1:
    print(f"First approach failed")

Output

I get the following output and behavior.

python test_local_load.py
Using custom data configuration disaster-9428d3f8c9e1b41b
Downloading and preparing dataset csv/disaster to /Users/fordaz/.cache/huggingface/datasets/csv/disaster-9428d3f8c9e1b41b/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 7772.03it/s]
Using custom data configuration disaster-9428d3f8c9e1b41b
Using custom data configuration disaster-9428d3f8c9e1b41b
Using custom data configuration disaster-9428d3f8c9e1b41b

The problem is that after these messages, it hangs forever.

Any hints will be appreciated, thanks in advance!

lhoestq · December 18, 2022, 11:12pm

Hi ! How big are those files ? Can you also try to kill the process using Ctrl+C and copy paste the KeyboardInterrupt traceback ? This should show where in the code it was hanging

fordaz · December 18, 2022, 11:30pm

Good idea. Here is the stack trace after stopping the process.

^CTraceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "/Users/***/workspace/Kaggle/DisasterTweets/test_local_load.py", line 4, in <module>
    disaster = load_dataset("./dataset/disaster/")
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 1741, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/builder.py", line 822, in download_and_prepare
    self._download_and_prepare(
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/builder.py", line 891, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 139, in _split_generators
    data_files = dl_manager.download_and_extract(self.config.data_files)
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/download/download_manager.py", line 447, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/download/download_manager.py", line 419, in extract
    extracted_paths = map_nested(
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 472, in map_nested
    mapped = pool.map(_single_map_nested, split_kwds)
  File "/Users/***/anaconda3/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/***/anaconda3/lib/python3.9/multiprocessing/pool.py", line 765, in get
    self.wait(timeout)
  File "/Users/***/anaconda3/lib/python3.9/multiprocessing/pool.py", line 762, in wait
    self._event.wait(timeout)
  File "/Users/***/anaconda3/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
  File "/Users/***/anaconda3/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

(huggingf01) user@MacBook-Pro DisasterTweets % /Users/***/anaconda3/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

The files are small :slight_smile, individually each one < 700K:

du -k dataset/disaster/*.csv
412	dataset/disaster/test.csv
688	dataset/disaster/train.csv
280	dataset/disaster/validation.csv

lhoestq · January 3, 2023, 11:11am

The code fails when multiprocessing is called. I suspect this is because you didn’t use

if __name__ == "__main__":

in your script. Therefore the subprocess would re-run your script and call load_dataset in the subprocess again.

fordaz · January 3, 2023, 2:15pm

Thanks for looking into this. To be honest, after recreating my conda environment this is no longer happening, but I’ll keep an eye on that point you mentioned if this occurs again.

lhoestq · January 3, 2023, 2:20pm

Cool ! FYI this is the kind of things that can happen with python >= 3.9 due to some changes in how multiprocessing works

fordaz · January 3, 2023, 4:23pm

Good to know, thanks for your help.

Topic		Replies	Views
Local dataset loading issues 🤗Datasets	3	411	February 23, 2024
Loading dataset get stuck Beginners	1	1262	January 15, 2024
Datasets map keeps hanging Beginners	0	673	April 7, 2024
How to handle big data? 🤗Datasets	7	1647	June 15, 2023
Traceback while loading image dataset 🤗Datasets	1	653	July 20, 2022

Load_dataset hangs with local files

Environment:

Local data files

Python code

Output

Related topics