Load_dataset hangs with local files

I’m trying to load a local dataset using load_dataset. After invoking load_dataset it hangs forever, it never finishes. Here are the details:

Environment:

Python 3.9.12 (main, Apr  5 2022, 01:53:17)
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin

conda 22.9.0

datasets==2.7.0

Local data files

The contents of the data folder is:

./dataset/disaster (relative directory)
    train.csv
    validation.csv
    test.csv

Python code

Using the following python code in test_local_load.py

from datasets import load_dataset

try:
    disaster = load_dataset("./dataset/disaster/")
    print(f" the type {type(disaster)}\n{disaster}")
except Exception as e1:
    print(f"First approach failed")

Output

I get the following output and behavior.

python test_local_load.py
Using custom data configuration disaster-9428d3f8c9e1b41b
Downloading and preparing dataset csv/disaster to /Users/fordaz/.cache/huggingface/datasets/csv/disaster-9428d3f8c9e1b41b/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...
Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 7772.03it/s]
Using custom data configuration disaster-9428d3f8c9e1b41b
Using custom data configuration disaster-9428d3f8c9e1b41b
Using custom data configuration disaster-9428d3f8c9e1b41b

The problem is that after these messages, it hangs forever.

Any hints will be appreciated, thanks in advance!

Hi ! How big are those files ? Can you also try to kill the process using Ctrl+C and copy paste the KeyboardInterrupt traceback ? This should show where in the code it was hanging

Good idea. Here is the stack trace after stopping the process.

^CTraceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
Traceback (most recent call last):
  File "/Users/***/workspace/Kaggle/DisasterTweets/test_local_load.py", line 4, in <module>
    disaster = load_dataset("./dataset/disaster/")
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 1741, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/builder.py", line 822, in download_and_prepare
    self._download_and_prepare(
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/builder.py", line 891, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/csv/csv.py", line 139, in _split_generators
    data_files = dl_manager.download_and_extract(self.config.data_files)
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/download/download_manager.py", line 447, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/download/download_manager.py", line 419, in extract
    extracted_paths = map_nested(
  File "/Users/***/anaconda3/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 472, in map_nested
    mapped = pool.map(_single_map_nested, split_kwds)
  File "/Users/***/anaconda3/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/***/anaconda3/lib/python3.9/multiprocessing/pool.py", line 765, in get
    self.wait(timeout)
  File "/Users/***/anaconda3/lib/python3.9/multiprocessing/pool.py", line 762, in wait
    self._event.wait(timeout)
  File "/Users/***/anaconda3/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
  File "/Users/***/anaconda3/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

(huggingf01) user@MacBook-Pro DisasterTweets % /Users/***/anaconda3/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

The files are small :slight_smile, individually each one < 700K:

du -k dataset/disaster/*.csv
412	dataset/disaster/test.csv
688	dataset/disaster/train.csv
280	dataset/disaster/validation.csv

The code fails when multiprocessing is called. I suspect this is because you didn’t use

if __name__ == "__main__":

in your script. Therefore the subprocess would re-run your script and call load_dataset in the subprocess again.

Thanks for looking into this. To be honest, after recreating my conda environment this is no longer happening, but I’ll keep an eye on that point you mentioned if this occurs again.

Cool ! FYI this is the kind of things that can happen with python >= 3.9 due to some changes in how multiprocessing works

Good to know, thanks for your help.