While trying to download a large dataset(~100GB), without streaming mode like this:
from datasets import load_dataset
mc4_dataset = load_dataset("mc4", "hi")
I first got an error:
multiprocessing.pool.RemoteTraceback:
ConnectionError: Couldn't reach https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/multilingual/c4-hi.tfrecord-00709-of-01024.json.gz
On running the same 2 line script again, the downloads resumed but then crashed with a single line message Segmentation fault (core dumped)
.
Rerun of same script again gives the following message:
Downloading and preparing dataset mc4/hi (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/user/.cache/huggingface/datasets/mc4/hi/0.0.0/a2bc8f2c4d913b8b16fac4d1a63d673fa6cb22859520dcac7f193feec1f00cae...
Segmentation fault (core dumped)
Any suggestions on how to debug this error?
- Thereās a lock file in
~/.cache/huggingface/datasets/
-
~/.cache/huggingface/datasets/mc4/hi/
contains hash.āincompleteā directory, which is empty -
~/.cache/huggingface/datasets/downloads/
contains a lot of hash-id files and locks.
In this state, is there anything we could do to repair the state and continue without having to re-download entire dataset from scratch? Also in worst case, is there anydatasets
alternative of rm -r
purge, as not only mc4
but I think the lock and the downloads
dir contents will need to go away.