Segmentation fault (Core dumped) with datasets

dk-crazydiv · July 8, 2021, 5:53am

While trying to download a large dataset(~100GB), without streaming mode like this:

from datasets import load_dataset
mc4_dataset = load_dataset("mc4", "hi")

I first got an error:

multiprocessing.pool.RemoteTraceback: 
ConnectionError: Couldn't reach https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/multilingual/c4-hi.tfrecord-00709-of-01024.json.gz

On running the same 2 line script again, the downloads resumed but then crashed with a single line message Segmentation fault (core dumped).

Rerun of same script again gives the following message:

Downloading and preparing dataset mc4/hi (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/user/.cache/huggingface/datasets/mc4/hi/0.0.0/a2bc8f2c4d913b8b16fac4d1a63d673fa6cb22859520dcac7f193feec1f00cae...
Segmentation fault (core dumped)

Any suggestions on how to debug this error?

There’s a lock file in ~/.cache/huggingface/datasets/
~/.cache/huggingface/datasets/mc4/hi/ contains hash.“incomplete” directory, which is empty
~/.cache/huggingface/datasets/downloads/ contains a lot of hash-id files and locks.

In this state, is there anything we could do to repair the state and continue without having to re-download entire dataset from scratch? Also in worst case, is there anydatasets alternative of rm -r purge, as not only mc4 but I think the lock and the downloads dir contents will need to go away.

patrickvonplaten · July 9, 2021, 3:22pm

@lhoestq have you seen this before by any chance?

lhoestq · July 9, 2021, 3:52pm

If some files are corrupted, I think you will have to redownload the dataset (parameter download_mode="force_redownload" in load_dataset) unfortunately.

Another solution would be to just redownload the files that you think are corrupted. To do so you can use this code to know which file corresponds to which url in the cache directory:

from datasets import cached_path

url = "https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/multilingual/c4-hi.tfrecord-00709-of-01024.json.gz"
cache_dir = "~/.cache/huggingface/datasets/downloads/"
path = cached_path(url, cache_dir=cache_dir)
print(path)
# /.cache/huggingface/datasets/downloads/d487f651e3538b42eb6c8e8b70b347d3df3d3655bcde0c2c0e99601af5fb5542.ead117e326992266f523ac19fb4348a433bcb6541c511bf0a2de7fc6641a6876

You can delete the files that you think are corrupted and then load your dataset. It will redownload the missing files

Topic		Replies	Views
Loading datasets on MacOS X causing segmentation fault Beginners	1	405	April 22, 2024
Unable to Train for a Long Time 🤗Datasets	4	1858	February 16, 2023
Error EBUG:filelock:Attempting to acquire lock / related to cache Beginners	0	341	April 27, 2024
Error EBUG:filelock:Attempting to acquire lock 🤗Datasets	0	941	April 27, 2024
Unable to download large datasets 🤗Datasets	2	47	April 8, 2025

Segmentation fault (Core dumped) with datasets

Related topics