Segmentation fault (Core dumped) with datasets

While trying to download a large dataset(~100GB), without streaming mode like this:

from datasets import load_dataset
mc4_dataset = load_dataset("mc4", "hi")

I first got an error:

multiprocessing.pool.RemoteTraceback: 
ConnectionError: Couldn't reach https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/multilingual/c4-hi.tfrecord-00709-of-01024.json.gz

On running the same 2 line script again, the downloads resumed but then crashed with a single line message Segmentation fault (core dumped).

Rerun of same script again gives the following message:

Downloading and preparing dataset mc4/hi (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/user/.cache/huggingface/datasets/mc4/hi/0.0.0/a2bc8f2c4d913b8b16fac4d1a63d673fa6cb22859520dcac7f193feec1f00cae...
Segmentation fault (core dumped)

Any suggestions on how to debug this error?

  • There’s a lock file in ~/.cache/huggingface/datasets/
  • ~/.cache/huggingface/datasets/mc4/hi/ contains hash.“incomplete” directory, which is empty
  • ~/.cache/huggingface/datasets/downloads/ contains a lot of hash-id files and locks.

In this state, is there anything we could do to repair the state and continue without having to re-download entire dataset from scratch? Also in worst case, is there anydatasets alternative of rm -r purge, as not only mc4 but I think the lock and the downloads dir contents will need to go away.

@lhoestq have you seen this before by any chance?

If some files are corrupted, I think you will have to redownload the dataset (parameter download_mode="force_redownload" in load_dataset) unfortunately.

Another solution would be to just redownload the files that you think are corrupted. To do so you can use this code to know which file corresponds to which url in the cache directory:

from datasets import cached_path

url = "https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/multilingual/c4-hi.tfrecord-00709-of-01024.json.gz"
cache_dir = "~/.cache/huggingface/datasets/downloads/"
path = cached_path(url, cache_dir=cache_dir)
print(path)
# /.cache/huggingface/datasets/downloads/d487f651e3538b42eb6c8e8b70b347d3df3d3655bcde0c2c0e99601af5fb5542.ead117e326992266f523ac19fb4348a433bcb6541c511bf0a2de7fc6641a6876

You can delete the files that you think are corrupted and then load your dataset. It will redownload the missing files

1 Like