While trying to download a large dataset(~100GB), without streaming mode like this:
from datasets import load_dataset mc4_dataset = load_dataset("mc4", "hi")
I first got an error:
multiprocessing.pool.RemoteTraceback: ConnectionError: Couldn't reach https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/multilingual/c4-hi.tfrecord-00709-of-01024.json.gz
On running the same 2 line script again, the downloads resumed but then crashed with a single line message
Segmentation fault (core dumped).
Rerun of same script again gives the following message:
Downloading and preparing dataset mc4/hi (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/user/.cache/huggingface/datasets/mc4/hi/0.0.0/a2bc8f2c4d913b8b16fac4d1a63d673fa6cb22859520dcac7f193feec1f00cae... Segmentation fault (core dumped)
Any suggestions on how to debug this error?
- There’s a lock file in
~/.cache/huggingface/datasets/mc4/hi/contains hash.“incomplete” directory, which is empty
~/.cache/huggingface/datasets/downloads/contains a lot of hash-id files and locks.
In this state, is there anything we could do to repair the state and continue without having to re-download entire dataset from scratch? Also in worst case, is there any
datasets alternative of
rm -r purge, as not only
mc4 but I think the lock and the
downloads dir contents will need to go away.